a backreference problem?

Discussion in 'Perl Misc' started by Geoff Cox, Aug 23, 2003.

  1. Geoff Cox

    Geoff Cox Guest

    Hello,

    I can use

    $string =~ /="(.*)"\.doc/;
    print $1;

    which will get "docs/path/word"
    from <a href="docs/path/word.doc">link to word doc</a> (A)

    But! What if I have a file with say 100 lines similar to A above? How
    do I deal with multiple values of $1?

    Cheers

    Geoff
     
    Geoff Cox, Aug 23, 2003
    #1
    1. Advertising

  2. Geoff Cox <> wrote:

    > I can use
    >
    > $string =~ /="(.*)"\.doc/;
    > print $1;



    Yes, but you shouldn't.

    You should never use the dollar-digit variables unless you
    have first ensured that the match _succeeded_.


    if ( $string =~ /="(.*)"\.doc/ )
    { print $1 }


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Aug 23, 2003
    #2
    1. Advertising

  3. Geoff Cox <> wrote:


    > $string =~ /="(.*)"\.doc/;

    ^^^
    ^^^
    > which will get "docs/path/word"

    ^^^^^^^^
    ^^^^^^^^ No it won't.


    > from <a href="docs/path/word.doc">link to word doc</a> (A)



    Your pattern requires a double quote before a dot.

    The string does not contain a double quote before a dot.

    The match must fail, and $1 will *not* be set, it will be left
    with the same value that it had before the match was attempted.


    > But! What if I have a file with say 100 lines similar to A above? How
    > do I deal with multiple values of $1?



    It depends on what "deal with" means when you say it.

    The answer would probably involve one of Perl's looping constructs
    and/or aggregate data types.

    We would need a better question in order to give a better answer.

    If "deal with" means "print dollar one" for instance, then the
    answer would be "use a while(<FILE>) loop".


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Aug 23, 2003
    #3
  4. Geoff Cox

    Peter Cooper Guest

    "Geoff Cox" <> wrote:
    > which will get "docs/path/word"
    > from <a href="docs/path/word.doc">link to word doc</a> (A)
    >
    > But! What if I have a file with say 100 lines similar to A above? How
    > do I deal with multiple values of $1?


    You can match 'many' things into an array like so:

    my $data1 = q{
    <a href="docs/path/word.doc">link to word doc</a>
    <a href="docs/path/word2.doc">link to word doc</a>
    <a href="docs/path/word3.doc">link to word doc</a>
    };

    (@names) = ($data1 =~ /="(.*?)\.doc"/gsi);
    print $_ . "\n" for @names;

    However, if you really want to parse HTML, and aren't just using HTML as an
    example here, you will want to look into modules which are dedicated to this
    purpose. Look at the HTML Parser set at
    http://search.cpan.org/author/GAAS/HTML-Parser-3.31/ . HTML::LinkExtor (a
    link extractor) may be of particular use to you.

    Regards,
    Peter Cooper
     
    Peter Cooper, Aug 24, 2003
    #4
  5. Geoff Cox

    Geoff Cox Guest

    On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"
    <> wrote:

    Peter et al ...

    Now trying this - you will perhaps see better what I am trying to
    do...problem with the passing of $1 to the sub getintro - I get an
    uninitialized value in pattern match error ...

    Cheers

    Geoff

    open(IN, "a2-left.htm");
    open(OUT, ">>out");
    open(INN, "total");

    if (open(IN, "a2-left.htm")) {

    $line = <IN>;

    while ($line ne "") {
    if ($line =~ /^<a href/) {
    if ($line =~ /="(.*)\.doc/) {
    &getintro($1);
    }
    }
    $line = <IN>;

    }
    }
    sub getintro {

    @intro = <INN>;
    for ($n=0;$n<900;$n++) {
    if ($into[$n] =~ /$1/) {
    print OUT ("$into[$n]\n");
    print OUT ("$line[$n-1]\n");
    }
    }
    }

    close (IN);
    close (OUT);
    close (INN);




    >"Geoff Cox" <> wrote:
    >> which will get "docs/path/word"
    >> from <a href="docs/path/word.doc">link to word doc</a> (A)
    >>
    >> But! What if I have a file with say 100 lines similar to A above? How
    >> do I deal with multiple values of $1?

    >
    >You can match 'many' things into an array like so:
    >
    >my $data1 = q{
    ><a href="docs/path/word.doc">link to word doc</a>
    ><a href="docs/path/word2.doc">link to word doc</a>
    ><a href="docs/path/word3.doc">link to word doc</a>
    >};
    >
    >(@names) = ($data1 =~ /="(.*?)\.doc"/gsi);
    >print $_ . "\n" for @names;
    >
    >However, if you really want to parse HTML, and aren't just using HTML as an
    >example here, you will want to look into modules which are dedicated to this
    >purpose. Look at the HTML Parser set at
    >http://search.cpan.org/author/GAAS/HTML-Parser-3.31/ . HTML::LinkExtor (a
    >link extractor) may be of particular use to you.
    >
    >Regards,
    >Peter Cooper
    >
     
    Geoff Cox, Aug 24, 2003
    #5
  6. Geoff Cox

    Geoff Cox Guest

    On Sun, 24 Aug 2003 09:41:55 +0100, Geoff Cox
    <> wrote:


    I know there are 2 mistakes re $into where it should read $intro etc .
    have corrected these but still get same error message....

    Geoff

    >On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"
    ><> wrote:
    >
    >Peter et al ...
    >
    >Now trying this - you will perhaps see better what I am trying to
    >do...problem with the passing of $1 to the sub getintro - I get an
    >uninitialized value in pattern match error ...
    >
    >Cheers
    >
    >Geoff
    >
    >open(IN, "a2-left.htm");
    >open(OUT, ">>out");
    >open(INN, "total");
    >
    >if (open(IN, "a2-left.htm")) {
    >
    >$line = <IN>;
    >
    >while ($line ne "") {
    >if ($line =~ /^<a href/) {
    >if ($line =~ /="(.*)\.doc/) {
    >&getintro($1);
    >}
    >}
    >$line = <IN>;
    >
    >}
    >}
    >sub getintro {
    >
    >@intro = <INN>;
    >for ($n=0;$n<900;$n++) {
    >if ($into[$n] =~ /$1/) {
    >print OUT ("$into[$n]\n");
    >print OUT ("$line[$n-1]\n");
    >}
    >}
    >}
    >
    >close (IN);
    >close (OUT);
    >close (INN);
    >
    >
    >
    >
    >>"Geoff Cox" <> wrote:
    >>> which will get "docs/path/word"
    >>> from <a href="docs/path/word.doc">link to word doc</a> (A)
    >>>
    >>> But! What if I have a file with say 100 lines similar to A above? How
    >>> do I deal with multiple values of $1?

    >>
    >>You can match 'many' things into an array like so:
    >>
    >>my $data1 = q{
    >><a href="docs/path/word.doc">link to word doc</a>
    >><a href="docs/path/word2.doc">link to word doc</a>
    >><a href="docs/path/word3.doc">link to word doc</a>
    >>};
    >>
    >>(@names) = ($data1 =~ /="(.*?)\.doc"/gsi);
    >>print $_ . "\n" for @names;
    >>
    >>However, if you really want to parse HTML, and aren't just using HTML as an
    >>example here, you will want to look into modules which are dedicated to this
    >>purpose. Look at the HTML Parser set at
    >>http://search.cpan.org/author/GAAS/HTML-Parser-3.31/ . HTML::LinkExtor (a
    >>link extractor) may be of particular use to you.
    >>
    >>Regards,
    >>Peter Cooper
    >>
     
    Geoff Cox, Aug 24, 2003
    #6
  7. Geoff Cox

    Geoff Cox Guest

    On Sun, 24 Aug 2003 10:42:09 +0100, Geoff Cox
    <> wrote:


    >which is odd...the value for $1 does get into the sub getintro but get
    >the error message "uninitialized value in pattern match" for the line
    >
    >if ($into[$n] =~ /$1/) {


    have improved code by using strict but still get above error message?!

    use strict;

    open(IN, "a2-left.htm");
    open(OUT, ">>out");
    open(INN, "total");


    my $line = <IN>;

    while ($line ne "") {
    if ($line =~ /^<a href/) {
    if ($line =~ /="(.*)\.doc/) {
    &getintro($1);
    }
    }
    $line = <IN>;
    }

    sub getintro {
    my $n;
    my @intro = <INN>;
    for ($n=0;$n<900;$n++) {
    if ($intro[$n] =~ /$1/) {
    print OUT ("$intro[$n]\n");
    print OUT ("$intro[$n-1]\n");
    }
    }
    }
    close (IN);
    close (OUT);
    close (INN);
     
    Geoff Cox, Aug 24, 2003
    #7
  8. "Geoff Cox" <> wrote in message
    news:...
    > On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"
    > <> wrote:
    >
    > Peter et al ...
    >
    > Now trying this - you will perhaps see better what I am trying to
    > do...problem with the passing of $1 to the sub getintro - I get an
    > uninitialized value in pattern match error ...
    >
    > Cheers
    >
    > Geoff
    >
    > open(IN, "a2-left.htm");
    > open(OUT, ">>out");
    > open(INN, "total");
    >
    > if (open(IN, "a2-left.htm")) {


    Why are you asking to do something if and only if the filehandle is open?
    You opened it 3 lines above.

    >
    > $line = <IN>;
    >
    > while ($line ne "") {


    better for 2 above lines:

    while (defined $line = <IN>) {
    next if $line =~ /^$/;

    > if ($line =~ /^<a href/) {


    Right here it becomes apparent that you're trying to parse HTML -- which
    means you should heed Peter's advice to check out HTML::parser.

    > if ($line =~ /="(.*)\.doc/) {
    > &getintro($1);
    > }
    > }
    > $line = <IN>;
    >

    What's the purpose of the line above?

    > }
    > }
    > sub getintro {
    >
    > @intro = <INN>;


    You don't appear to do anything with the content of @intro, so why read from
    <INN> at all?

    > for ($n=0;$n<900;$n++) {
    > if ($into[$n] =~ /$1/) {


    .... unless, that is, you have a typo in line above and meant $intro

    But here $1 contains the result of the first captured expression on the last
    matching line ... which may not always be what you want.

    > print OUT ("$into[$n]\n");
    > print OUT ("$line[$n-1]\n");
    > }
    > }
    > }
    >
    > close (IN);
    > close (OUT);
    > close (INN);
    >


    Note: The subject of your OP was "backreference problem." But at no point
    in the discussion have you used any backreferences (e.g., \1 as part of a
    pattern match). This leads me to suspect that you just don't understand
    Perl regexes very well. I recommend going to a good Perl text (e.g., the
    llama) and carefully working through the exercises on regexes.
     
    James E Keenan, Aug 24, 2003
    #8
  9. Geoff Cox <> wrote:


    Have you seen the Posting Guidelines that are posted here frequently?


    > any ideas?



    Indent your code for human readability if you want humans to read it.

    Many people will not take the time to read your code because you
    did not take the time to make it easy for them to read your code.


    > open(IN, "a2-left.htm");



    You should always, yes *always*, check the return value from open().

    You were doing that, but now you've taken it back out.

    open(IN, 'a2-left.htm') or die "could not open 'a2-left.htm' $!";


    > sub getintro {
    >
    > my $n;
    >
    > print ("$1\n");
    >
    > my @intro = <INN>;
    > for ($n=0;$n<900;$n++) {



    foreach my $n ( 0 .. 899 ) { # does the same thing


    > if ($intro[$n] =~ /$1/) {
    > &print;
    > }
    > }
    >
    > sub print {
    > print OUT ("$intro[$n]\n");

    ^^
    ^^ $n is undefined



    [snip TOFU, please do not do that anymore]

    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Aug 24, 2003
    #9
  10. Geoff Cox <> wrote:


    > &getintro($1);



    Why are you passing an argument when the subroutine definition
    never makes use of the argument that you passed?


    > sub getintro {



    my( $file ) = @_;


    > my $n;
    >
    > print ("$1\n");



    print ("$file\n");


    > if ($intro[$n] =~ /$1/) {



    if ($intro[$n] =~ /$file/) {


    > &print;



    print OUT "$intro[$n]\n"



    [snip TOFU]

    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Aug 24, 2003
    #10
  11. Geoff Cox

    Geoff Cox Guest

    On 24 Aug 2003 12:35:34 GMT, "James E Keenan" <>
    wrote:

    James,

    Apologies for calling you John!

    Geoff

    >
    >"Geoff Cox" <> wrote in message
    >news:...
    >> On Sun, 24 Aug 2003 00:05:44 +0100, "Peter Cooper"
    >> <> wrote:
    >>
    >> Peter et al ...
    >>
    >> Now trying this - you will perhaps see better what I am trying to
    >> do...problem with the passing of $1 to the sub getintro - I get an
    >> uninitialized value in pattern match error ...
    >>
    >> Cheers
    >>
    >> Geoff
    >>
    >> open(IN, "a2-left.htm");
    >> open(OUT, ">>out");
    >> open(INN, "total");
    >>
    >> if (open(IN, "a2-left.htm")) {

    >
    >Why are you asking to do something if and only if the filehandle is open?
    >You opened it 3 lines above.
    >
    >>
    >> $line = <IN>;
    >>
    >> while ($line ne "") {

    >
    >better for 2 above lines:
    >
    > while (defined $line = <IN>) {
    > next if $line =~ /^$/;
    >
    >> if ($line =~ /^<a href/) {

    >
    >Right here it becomes apparent that you're trying to parse HTML -- which
    >means you should heed Peter's advice to check out HTML::parser.
    >
    >> if ($line =~ /="(.*)\.doc/) {
    >> &getintro($1);
    >> }
    >> }
    >> $line = <IN>;
    >>

    >What's the purpose of the line above?
    >
    >> }
    >> }
    >> sub getintro {
    >>
    >> @intro = <INN>;

    >
    >You don't appear to do anything with the content of @intro, so why read from
    ><INN> at all?
    >
    >> for ($n=0;$n<900;$n++) {
    >> if ($into[$n] =~ /$1/) {

    >
    >... unless, that is, you have a typo in line above and meant $intro
    >
    >But here $1 contains the result of the first captured expression on the last
    >matching line ... which may not always be what you want.
    >
    >> print OUT ("$into[$n]\n");
    >> print OUT ("$line[$n-1]\n");
    >> }
    >> }
    >> }
    >>
    >> close (IN);
    >> close (OUT);
    >> close (INN);
    >>

    >
    >Note: The subject of your OP was "backreference problem." But at no point
    >in the discussion have you used any backreferences (e.g., \1 as part of a
    >pattern match). This leads me to suspect that you just don't understand
    >Perl regexes very well. I recommend going to a good Perl text (e.g., the
    >llama) and carefully working through the exercises on regexes.
    >
    >
     
    Geoff Cox, Aug 24, 2003
    #11
  12. Geoff Cox

    Geoff Cox Guest

    James,

    following code nearly there ... just one major problem ----- I would
    like to have the text from the getintro to be in the order in which
    the path is obtained from the a2-left.htm file but it is different
    here ...From memory I think the problem is that

    @intro = <INN>;

    is in random order? is there a way round this?

    Cheers

    Geoff


    in the


    use strict;

    open(IN, "a2-left.htm");
    open(OUT, ">>out");
    open(INN, "total");

    my $line = <IN>;

    while ($line ne "") {

    if ($line =~ /^<a href/) {

    if ($line =~ /="(.*)\.doc/) {
    my $found = $1;
    &getintro($found);
    }

    }

    $line =<IN>;
    }



    sub getintro {
    my $found;
    my $n;

    my @intro = <INN>;
    for ($n=0;$n<900;$n++) {
    if ($intro[$n] =~ /^<a href/) {
    if ($intro[$n] =~ /$found/) {
    &print;
    }
    }

    }

    sub print {

    print OUT ("<tr>$intro[$n-1]\n");
    print OUT ("$intro[$n]</tr>\n");
    }

    }


    close (IN);
    close (OUT);
    close (INN);
     
    Geoff Cox, Aug 24, 2003
    #12
  13. Geoff Cox

    Geoff Cox Guest

    On Sun, 24 Aug 2003 09:58:19 -0500, (Tad
    McClellan) wrote:

    Tad,

    the code below now does what I want - ie for each path to a Word doc
    name in a2-left.htm it finds the same path etc in the file total and
    gets the introductory text associated with this doc....

    I am sure there are better ways fo doing this...any thoughts? The sub
    getintro seems poor..by the way it seems important to open and close
    the total file each time the sub getintro is used...

    Cheers

    Geoff

    use strict;

    open(IN, "a2-left.htm");
    open(OUT, ">>out");

    my $line = <IN>;

    while ($line ne "") {
    if ($line =~ /^<a href/) {
    if ($line =~ /href="(.*)\.doc/) {
    &getintro($1);
    }
    }
    $line =<IN>;
    }


    sub getintro {
    open (INN, "total");
    my $file = $1;
    my $n;
    my @intro = <INN>;

    for ($n=0;$n<900;$n++) {
    if ($intro[$n] =~ /$file/i) {
    print OUT ("<tr>$intro[$n-1]\n");
    print OUT ("$intro[$n]</tr>\n");
    }
    }
    close (INN);
    }


    close (IN);
    close (OUT);





    >Geoff Cox <> wrote:
    >
    >
    >> &getintro($1);

    >
    >
    >Why are you passing an argument when the subroutine definition
    >never makes use of the argument that you passed?
    >
    >
    >> sub getintro {

    >
    >
    > my( $file ) = @_;
    >
    >
    >> my $n;
    >>
    >> print ("$1\n");

    >
    >
    > print ("$file\n");
    >
    >
    >> if ($intro[$n] =~ /$1/) {

    >
    >
    > if ($intro[$n] =~ /$file/) {
    >
    >
    >> &print;

    >
    >
    > print OUT "$intro[$n]\n"
    >
    >
    >
    >[snip TOFU]
     
    Geoff Cox, Aug 24, 2003
    #13
  14. Geoff Cox

    Geoff Cox Guest

    On Sun, 24 Aug 2003 18:55:34 GMT, (Jay Tilton)
    wrote:

    Jay,

    Just to thank you for your comments - I will read them tomorrow...a
    little sleep required!

    Cheers

    Geoff
     
    Geoff Cox, Aug 25, 2003
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. paulm

    Newbie backreference question

    paulm, Jun 30, 2005, in forum: Python
    Replies:
    6
    Views:
    393
    paulm
    Jul 1, 2005
  2. Fredrik Lundh

    backreference in regexp

    Fredrik Lundh, Jan 31, 2006, in forum: Python
    Replies:
    2
    Views:
    365
    =?ISO-8859-1?Q?Sch=FCle_Daniel?=
    Jan 31, 2006
  3. Replies:
    4
    Views:
    662
    jeff emminger
    Aug 18, 2006
  4. abdulet
    Replies:
    2
    Views:
    590
    abdulet
    Oct 23, 2009
  5. Replies:
    4
    Views:
    143
Loading...

Share This Page