a backreference problem?

G

Geoff Cox

Hello,

I can use

$string =~ /="(.*)"\.doc/;
print $1;

which will get "docs/path/word"
from <a href="docs/path/word.doc">link to word doc</a> (A)

But! What if I have a file with say 100 lines similar to A above? How
do I deal with multiple values of $1?

Cheers

Geoff
 
T

Tad McClellan

Geoff Cox said:
I can use

$string =~ /="(.*)"\.doc/;
print $1;


Yes, but you shouldn't.

You should never use the dollar-digit variables unless you
have first ensured that the match _succeeded_.


if ( $string =~ /="(.*)"\.doc/ )
{ print $1 }
 
T

Tad McClellan

$string =~ /="(.*)"\.doc/; ^^^
^^^
which will get "docs/path/word"
^^^^^^^^
^^^^^^^^ No it won't.

from <a href="docs/path/word.doc">link to word doc</a> (A)


Your pattern requires a double quote before a dot.

The string does not contain a double quote before a dot.

The match must fail, and $1 will *not* be set, it will be left
with the same value that it had before the match was attempted.

But! What if I have a file with say 100 lines similar to A above? How
do I deal with multiple values of $1?


It depends on what "deal with" means when you say it.

The answer would probably involve one of Perl's looping constructs
and/or aggregate data types.

We would need a better question in order to give a better answer.

If "deal with" means "print dollar one" for instance, then the
answer would be "use a while(<FILE>) loop".
 
P

Peter Cooper

Geoff Cox said:
which will get "docs/path/word"
from <a href="docs/path/word.doc">link to word doc</a> (A)

But! What if I have a file with say 100 lines similar to A above? How
do I deal with multiple values of $1?

You can match 'many' things into an array like so:

my $data1 = q{
<a href="docs/path/word.doc">link to word doc</a>
<a href="docs/path/word2.doc">link to word doc</a>
<a href="docs/path/word3.doc">link to word doc</a>
};

(@names) = ($data1 =~ /="(.*?)\.doc"/gsi);
print $_ . "\n" for @names;

However, if you really want to parse HTML, and aren't just using HTML as an
example here, you will want to look into modules which are dedicated to this
purpose. Look at the HTML Parser set at
http://search.cpan.org/author/GAAS/HTML-Parser-3.31/ . HTML::LinkExtor (a
link extractor) may be of particular use to you.

Regards,
Peter Cooper
 
G

Geoff Cox

On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"

Peter et al ...

Now trying this - you will perhaps see better what I am trying to
do...problem with the passing of $1 to the sub getintro - I get an
uninitialized value in pattern match error ...

Cheers

Geoff

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

if (open(IN, "a2-left.htm")) {

$line = <IN>;

while ($line ne "") {
if ($line =~ /^<a href/) {
if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;

}
}
sub getintro {

@intro = <INN>;
for ($n=0;$n<900;$n++) {
if ($into[$n] =~ /$1/) {
print OUT ("$into[$n]\n");
print OUT ("$line[$n-1]\n");
}
}
}

close (IN);
close (OUT);
close (INN);
 
G

Geoff Cox

On Sun, 24 Aug 2003 09:41:55 +0100, Geoff Cox


I know there are 2 mistakes re $into where it should read $intro etc .
have corrected these but still get same error message....

Geoff
On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"

Peter et al ...

Now trying this - you will perhaps see better what I am trying to
do...problem with the passing of $1 to the sub getintro - I get an
uninitialized value in pattern match error ...

Cheers

Geoff

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

if (open(IN, "a2-left.htm")) {

$line = <IN>;

while ($line ne "") {
if ($line =~ /^<a href/) {
if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;

}
}
sub getintro {

@intro = <INN>;
for ($n=0;$n<900;$n++) {
if ($into[$n] =~ /$1/) {
print OUT ("$into[$n]\n");
print OUT ("$line[$n-1]\n");
}
}
}

close (IN);
close (OUT);
close (INN);



You can match 'many' things into an array like so:

my $data1 = q{
<a href="docs/path/word.doc">link to word doc</a>
<a href="docs/path/word2.doc">link to word doc</a>
<a href="docs/path/word3.doc">link to word doc</a>
};

(@names) = ($data1 =~ /="(.*?)\.doc"/gsi);
print $_ . "\n" for @names;

However, if you really want to parse HTML, and aren't just using HTML as an
example here, you will want to look into modules which are dedicated to this
purpose. Look at the HTML Parser set at
http://search.cpan.org/author/GAAS/HTML-Parser-3.31/ . HTML::LinkExtor (a
link extractor) may be of particular use to you.

Regards,
Peter Cooper
 
G

Geoff Cox

which is odd...the value for $1 does get into the sub getintro but get
the error message "uninitialized value in pattern match" for the line

if ($into[$n] =~ /$1/) {

have improved code by using strict but still get above error message?!

use strict;

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");


my $line = <IN>;

while ($line ne "") {
if ($line =~ /^<a href/) {
if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;
}

sub getintro {
my $n;
my @intro = <INN>;
for ($n=0;$n<900;$n++) {
if ($intro[$n] =~ /$1/) {
print OUT ("$intro[$n]\n");
print OUT ("$intro[$n-1]\n");
}
}
}
close (IN);
close (OUT);
close (INN);
 
J

James E Keenan

Geoff Cox said:
On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"

Peter et al ...

Now trying this - you will perhaps see better what I am trying to
do...problem with the passing of $1 to the sub getintro - I get an
uninitialized value in pattern match error ...

Cheers

Geoff

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

if (open(IN, "a2-left.htm")) {

Why are you asking to do something if and only if the filehandle is open?
You opened it 3 lines above.
$line = <IN>;

while ($line ne "") {

better for 2 above lines:

if ($line =~ /^<a href/) {

Right here it becomes apparent that you're trying to parse HTML -- which
means you should heed Peter's advice to check out HTML::parser.
if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;
What's the purpose of the line above?
}
}
sub getintro {

@intro = <INN>;

You don't appear to do anything with the content of @intro, so why read from
for ($n=0;$n<900;$n++) {
if ($into[$n] =~ /$1/) {

.... unless, that is, you have a typo in line above and meant $intro

But here $1 contains the result of the first captured expression on the last
matching line ... which may not always be what you want.
print OUT ("$into[$n]\n");
print OUT ("$line[$n-1]\n");
}
}
}

close (IN);
close (OUT);
close (INN);

Note: The subject of your OP was "backreference problem." But at no point
in the discussion have you used any backreferences (e.g., \1 as part of a
pattern match). This leads me to suspect that you just don't understand
Perl regexes very well. I recommend going to a good Perl text (e.g., the
llama) and carefully working through the exercises on regexes.
 
T

Tad McClellan

Have you seen the Posting Guidelines that are posted here frequently?

any ideas?


Indent your code for human readability if you want humans to read it.

Many people will not take the time to read your code because you
did not take the time to make it easy for them to read your code.

open(IN, "a2-left.htm");


You should always, yes *always*, check the return value from open().

You were doing that, but now you've taken it back out.

open(IN, 'a2-left.htm') or die "could not open 'a2-left.htm' $!";

sub getintro {

my $n;

print ("$1\n");

my @intro = <INN>;
for ($n=0;$n<900;$n++) {


foreach my $n ( 0 .. 899 ) { # does the same thing

if ($intro[$n] =~ /$1/) {
&print;
}
}

sub print {
print OUT ("$intro[$n]\n");
^^
^^ $n is undefined



[snip TOFU, please do not do that anymore]
 
G

Geoff Cox

On 24 Aug 2003 12:35:34 GMT, "James E Keenan" <[email protected]>
wrote:

James,

Apologies for calling you John!

Geoff
Geoff Cox said:
On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"

Peter et al ...

Now trying this - you will perhaps see better what I am trying to
do...problem with the passing of $1 to the sub getintro - I get an
uninitialized value in pattern match error ...

Cheers

Geoff

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

if (open(IN, "a2-left.htm")) {

Why are you asking to do something if and only if the filehandle is open?
You opened it 3 lines above.
$line = <IN>;

while ($line ne "") {

better for 2 above lines:

if ($line =~ /^<a href/) {

Right here it becomes apparent that you're trying to parse HTML -- which
means you should heed Peter's advice to check out HTML::parser.
if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;
What's the purpose of the line above?
}
}
sub getintro {

@intro = <INN>;

You don't appear to do anything with the content of @intro, so why read from
for ($n=0;$n<900;$n++) {
if ($into[$n] =~ /$1/) {

... unless, that is, you have a typo in line above and meant $intro

But here $1 contains the result of the first captured expression on the last
matching line ... which may not always be what you want.
print OUT ("$into[$n]\n");
print OUT ("$line[$n-1]\n");
}
}
}

close (IN);
close (OUT);
close (INN);

Note: The subject of your OP was "backreference problem." But at no point
in the discussion have you used any backreferences (e.g., \1 as part of a
pattern match). This leads me to suspect that you just don't understand
Perl regexes very well. I recommend going to a good Perl text (e.g., the
llama) and carefully working through the exercises on regexes.
 
G

Geoff Cox

James,

following code nearly there ... just one major problem ----- I would
like to have the text from the getintro to be in the order in which
the path is obtained from the a2-left.htm file but it is different
here ...From memory I think the problem is that

@intro = <INN>;

is in random order? is there a way round this?

Cheers

Geoff


in the


use strict;

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

my $line = <IN>;

while ($line ne "") {

if ($line =~ /^<a href/) {

if ($line =~ /="(.*)\.doc/) {
my $found = $1;
&getintro($found);
}

}

$line =<IN>;
}



sub getintro {
my $found;
my $n;

my @intro = <INN>;
for ($n=0;$n<900;$n++) {
if ($intro[$n] =~ /^<a href/) {
if ($intro[$n] =~ /$found/) {
&print;
}
}

}

sub print {

print OUT ("<tr>$intro[$n-1]\n");
print OUT ("$intro[$n]</tr>\n");
}

}


close (IN);
close (OUT);
close (INN);
 
G

Geoff Cox

Tad,

the code below now does what I want - ie for each path to a Word doc
name in a2-left.htm it finds the same path etc in the file total and
gets the introductory text associated with this doc....

I am sure there are better ways fo doing this...any thoughts? The sub
getintro seems poor..by the way it seems important to open and close
the total file each time the sub getintro is used...

Cheers

Geoff

use strict;

open(IN, "a2-left.htm");
open(OUT, ">>out");

my $line = <IN>;

while ($line ne "") {
if ($line =~ /^<a href/) {
if ($line =~ /href="(.*)\.doc/) {
&getintro($1);
}
}
$line =<IN>;
}


sub getintro {
open (INN, "total");
my $file = $1;
my $n;
my @intro = <INN>;

for ($n=0;$n<900;$n++) {
if ($intro[$n] =~ /$file/i) {
print OUT ("<tr>$intro[$n-1]\n");
print OUT ("$intro[$n]</tr>\n");
}
}
close (INN);
}


close (IN);
close (OUT);




&getintro($1);


Why are you passing an argument when the subroutine definition
never makes use of the argument that you passed?

sub getintro {


my( $file ) = @_;

my $n;

print ("$1\n");


print ("$file\n");

if ($intro[$n] =~ /$1/) {


if ($intro[$n] =~ /$file/) {



print OUT "$intro[$n]\n"



[snip TOFU]
 
G

Geoff Cox

Jay,

Just to thank you for your comments - I will read them tomorrow...a
little sleep required!

Cheers

Geoff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top