a backreference problem?

Geoff Cox · Aug 23, 2003

Hello,

I can use

$string =~ /="(.*)"\.doc/;
print $1;

which will get "docs/path/word"
from <a href="docs/path/word.doc">link to word doc</a> (A)

But! What if I have a file with say 100 lines similar to A above? How
do I deal with multiple values of $1?

Cheers

Geoff

Tad McClellan · Aug 23, 2003

Geoff Cox said:
I can use

$string =~ /="(.*)"\.doc/;
print $1;

Yes, but you shouldn't.

You should never use the dollar-digit variables unless you
have first ensured that the match _succeeded_.

if ( $string =~ /="(.*)"\.doc/ )
{ print $1 }

Tad McClellan · Aug 23, 2003

$string =~ /="(.*)"\.doc/; ^^^
^^^
which will get "docs/path/word"

^^^^^^^^
^^^^^^^^ No it won't.

from <a href="docs/path/word.doc">link to word doc</a> (A)

Your pattern requires a double quote before a dot.

The string does not contain a double quote before a dot.

The match must fail, and $1 will *not* be set, it will be left
with the same value that it had before the match was attempted.

But! What if I have a file with say 100 lines similar to A above? How
do I deal with multiple values of $1?

It depends on what "deal with" means when you say it.

The answer would probably involve one of Perl's looping constructs
and/or aggregate data types.

We would need a better question in order to give a better answer.

If "deal with" means "print dollar one" for instance, then the
answer would be "use a while(<FILE>) loop".

Peter Cooper · Aug 24, 2003

Geoff Cox said:
which will get "docs/path/word"
from <a href="docs/path/word.doc">link to word doc</a> (A)

But! What if I have a file with say 100 lines similar to A above? How
do I deal with multiple values of $1?

You can match 'many' things into an array like so:

my $data1 = q{
<a href="docs/path/word.doc">link to word doc</a>
<a href="docs/path/word2.doc">link to word doc</a>
<a href="docs/path/word3.doc">link to word doc</a>
};

(@names) = ($data1 =~ /="(.*?)\.doc"/gsi);
print $_ . "\n" for @names;

However, if you really want to parse HTML, and aren't just using HTML as an
example here, you will want to look into modules which are dedicated to this
purpose. Look at the HTML Parser set at
http://search.cpan.org/author/GAAS/HTML-Parser-3.31/ . HTML::LinkExtor (a
link extractor) may be of particular use to you.

Regards,
Peter Cooper

Geoff Cox · Aug 24, 2003

On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"

Peter et al ...

Now trying this - you will perhaps see better what I am trying to
do...problem with the passing of $1 to the sub getintro - I get an
uninitialized value in pattern match error ...

Cheers

Geoff

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

if (open(IN, "a2-left.htm")) {

$line = <IN>;

while ($line ne "") {
if ($line =~ /^<a href/) {
if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;

}
}
sub getintro {

@intro = <INN>;
for ($n=0;$n<900;$n++) {
if ($into[$n] =~ /$1/) {
print OUT ("$into[$n]\n");
print OUT ("$line[$n-1]\n");
}
}
}

close (IN);
close (OUT);
close (INN);

Geoff Cox · Aug 24, 2003

On Sun, 24 Aug 2003 09:41:55 +0100, Geoff Cox

I know there are 2 mistakes re $into where it should read $intro etc .
have corrected these but still get same error message....

Geoff

On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"

Peter et al ...

Now trying this - you will perhaps see better what I am trying to
do...problem with the passing of $1 to the sub getintro - I get an
uninitialized value in pattern match error ...

Cheers

Geoff

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

if (open(IN, "a2-left.htm")) {

$line = <IN>;

while ($line ne "") {
if ($line =~ /^<a href/) {
if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;

}
}
sub getintro {

@intro = <INN>;
for ($n=0;$n<900;$n++) {
if ($into[$n] =~ /$1/) {
print OUT ("$into[$n]\n");
print OUT ("$line[$n-1]\n");
}
}
}

close (IN);
close (OUT);
close (INN);

You can match 'many' things into an array like so:

my $data1 = q{
<a href="docs/path/word.doc">link to word doc</a>
<a href="docs/path/word2.doc">link to word doc</a>
<a href="docs/path/word3.doc">link to word doc</a>
};

(@names) = ($data1 =~ /="(.*?)\.doc"/gsi);
print $_ . "\n" for @names;

However, if you really want to parse HTML, and aren't just using HTML as an
example here, you will want to look into modules which are dedicated to this
purpose. Look at the HTML Parser set at
http://search.cpan.org/author/GAAS/HTML-Parser-3.31/ . HTML::LinkExtor (a
link extractor) may be of particular use to you.

Regards,
Peter Cooper

Click to expand...

Geoff Cox · Aug 24, 2003

which is odd...the value for $1 does get into the sub getintro but get
the error message "uninitialized value in pattern match" for the line

if ($into[$n] =~ /$1/) {

have improved code by using strict but still get above error message?!

use strict;

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

my $line = <IN>;

while ($line ne "") {
if ($line =~ /^<a href/) {
if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;
}

sub getintro {
my $n;
my @intro = <INN>;
for ($n=0;$n<900;$n++) {
if ($intro[$n] =~ /$1/) {
print OUT ("$intro[$n]\n");
print OUT ("$intro[$n-1]\n");
}
}
}
close (IN);
close (OUT);
close (INN);

James E Keenan · Aug 24, 2003

Geoff Cox said:
On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"

Peter et al ...

Now trying this - you will perhaps see better what I am trying to
do...problem with the passing of $1 to the sub getintro - I get an
uninitialized value in pattern match error ...

Cheers

Geoff

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

if (open(IN, "a2-left.htm")) {

Why are you asking to do something if and only if the filehandle is open?
You opened it 3 lines above.

$line = <IN>;

while ($line ne "") {

better for 2 above lines:

if ($line =~ /^<a href/) {

Right here it becomes apparent that you're trying to parse HTML -- which
means you should heed Peter's advice to check out HTML:

arser.

if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;

What's the purpose of the line above?

}
}
sub getintro {

@intro = <INN>;

You don't appear to do anything with the content of @intro, so why read from

for ($n=0;$n<900;$n++) {
if ($into[$n] =~ /$1/) {

.... unless, that is, you have a typo in line above and meant $intro

But here $1 contains the result of the first captured expression on the last
matching line ... which may not always be what you want.

print OUT ("$into[$n]\n");
print OUT ("$line[$n-1]\n");
}
}
}

close (IN);
close (OUT);
close (INN);

Note: The subject of your OP was "backreference problem." But at no point
in the discussion have you used any backreferences (e.g., \1 as part of a
pattern match). This leads me to suspect that you just don't understand
Perl regexes very well. I recommend going to a good Perl text (e.g., the
llama) and carefully working through the exercises on regexes.

Tad McClellan · Aug 24, 2003

Have you seen the Posting Guidelines that are posted here frequently?

any ideas?

Indent your code for human readability if you want humans to read it.

Many people will not take the time to read your code because you
did not take the time to make it easy for them to read your code.

open(IN, "a2-left.htm");

You should always, yes *always*, check the return value from open().

You were doing that, but now you've taken it back out.

open(IN, 'a2-left.htm') or die "could not open 'a2-left.htm' $!";

sub getintro {

my $n;

print ("$1\n");

my @intro = <INN>;
for ($n=0;$n<900;$n++) {

foreach my $n ( 0 .. 899 ) { # does the same thing

if ($intro[$n] =~ /$1/) {
&print;
}
}

sub print {
print OUT ("$intro[$n]\n");

^^
^^ $n is undefined

[snip TOFU, please do not do that anymore]

Tad McClellan · Aug 24, 2003

&getintro($1);

Why are you passing an argument when the subroutine definition
never makes use of the argument that you passed?

sub getintro {

my( $file ) = @_;

my $n;

print ("$1\n");

print ("$file\n");

if ($intro[$n] =~ /$1/) {

if ($intro[$n] =~ /$file/) {

&print;

print OUT "$intro[$n]\n"

[snip TOFU]

Geoff Cox · Aug 24, 2003

On 24 Aug 2003 12:35:34 GMT, "James E Keenan" <[email protected]>
wrote:

James,

Apologies for calling you John!

Geoff

Geoff Cox said:
Geoff Cox said:

On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"

Peter et al ...

Now trying this - you will perhaps see better what I am trying to
do...problem with the passing of $1 to the sub getintro - I get an
uninitialized value in pattern match error ...

Cheers

Geoff

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

if (open(IN, "a2-left.htm")) {

Click to expand...

Why are you asking to do something if and only if the filehandle is open?
You opened it 3 lines above.

$line = <IN>;

while ($line ne "") {

Click to expand...

better for 2 above lines:

if ($line =~ /^<a href/) {

Click to expand...

Right here it becomes apparent that you're trying to parse HTML -- which
means you should heed Peter's advice to check out HTML:arser.

if ($line =~ /="(.*)\.doc/) {
&getintro($1);
}
}
$line = <IN>;

Click to expand...

What's the purpose of the line above?

}
}
sub getintro {

@intro = <INN>;

Click to expand...

You don't appear to do anything with the content of @intro, so why read from

for ($n=0;$n<900;$n++) {
if ($into[$n] =~ /$1/) {

Click to expand...

... unless, that is, you have a typo in line above and meant $intro

But here $1 contains the result of the first captured expression on the last
matching line ... which may not always be what you want.

print OUT ("$into[$n]\n");
print OUT ("$line[$n-1]\n");
}
}
}

close (IN);
close (OUT);
close (INN);

Click to expand...

Note: The subject of your OP was "backreference problem." But at no point
in the discussion have you used any backreferences (e.g., \1 as part of a
pattern match). This leads me to suspect that you just don't understand
Perl regexes very well. I recommend going to a good Perl text (e.g., the
llama) and carefully working through the exercises on regexes.

Geoff Cox · Aug 24, 2003

James,

following code nearly there ... just one major problem ----- I would
like to have the text from the getintro to be in the order in which
the path is obtained from the a2-left.htm file but it is different
here ...From memory I think the problem is that

@intro = <INN>;

is in random order? is there a way round this?

Cheers

Geoff

in the

use strict;

open(IN, "a2-left.htm");
open(OUT, ">>out");
open(INN, "total");

my $line = <IN>;

while ($line ne "") {

if ($line =~ /^<a href/) {

if ($line =~ /="(.*)\.doc/) {
my $found = $1;
&getintro($found);
}

}

$line =<IN>;
}

sub getintro {
my $found;
my $n;

my @intro = <INN>;
for ($n=0;$n<900;$n++) {
if ($intro[$n] =~ /^<a href/) {
if ($intro[$n] =~ /$found/) {
&print;
}
}

}

sub print {

print OUT ("<tr>$intro[$n-1]\n");
print OUT ("$intro[$n]</tr>\n");
}

}

close (IN);
close (OUT);
close (INN);

Geoff Cox · Aug 24, 2003

Tad,

the code below now does what I want - ie for each path to a Word doc
name in a2-left.htm it finds the same path etc in the file total and
gets the introductory text associated with this doc....

I am sure there are better ways fo doing this...any thoughts? The sub
getintro seems poor..by the way it seems important to open and close
the total file each time the sub getintro is used...

Cheers

Geoff

use strict;

open(IN, "a2-left.htm");
open(OUT, ">>out");

my $line = <IN>;

while ($line ne "") {
if ($line =~ /^<a href/) {
if ($line =~ /href="(.*)\.doc/) {
&getintro($1);
}
}
$line =<IN>;
}

sub getintro {
open (INN, "total");
my $file = $1;
my $n;
my @intro = <INN>;

for ($n=0;$n<900;$n++) {
if ($intro[$n] =~ /$file/i) {
print OUT ("<tr>$intro[$n-1]\n");
print OUT ("$intro[$n]</tr>\n");
}
}
close (INN);
}

close (IN);
close (OUT);

&getintro($1);

Click to expand...

Why are you passing an argument when the subroutine definition
never makes use of the argument that you passed?

sub getintro {

Click to expand...

my( $file ) = @_;

my $n;

print ("$1\n");

Click to expand...

print ("$file\n");

if ($intro[$n] =~ /$1/) {

Click to expand...

if ($intro[$n] =~ /$file/) {

&print;

Click to expand...

print OUT "$intro[$n]\n"

[snip TOFU]

Geoff Cox · Aug 25, 2003

Jay,

Just to thank you for your comments - I will read them tomorrow...a
little sleep required!

Cheers

Geoff

Regexp: Negation with backreference?	6	May 30, 2006
Python code problem	2	Apr 23, 2023
Bug? concatenate a number to a backreference: re.sub(r'(zzz:)xxx',r'\1'+str(4444), somevar)	2	Oct 23, 2009
Problem with codewars.	5	Dec 4, 2023
Is it possible to get some informations from a document in Google Docs and show it on my website ?	0	Nov 19, 2022
regexp help - substring of a backreference	4	Aug 7, 2010
Need Assistance With A Coding Problem	0	Aug 26, 2023
Problem Splitting Text String	2	Dec 29, 2022

a backreference problem?

Geoff Cox

Tad McClellan

Tad McClellan

Peter Cooper

Geoff Cox

Geoff Cox

Geoff Cox

James E Keenan

Tad McClellan

Tad McClellan

Geoff Cox

Geoff Cox

Geoff Cox

Geoff Cox

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads