regex problem

Discussion in 'Perl Misc' started by sstark, Jun 12, 2009.

  1. sstark

    sstark Guest

    Hi all, I know I will be told that I should be using one of the HTML
    or XML parsers to perform this task, and you're right! :) but still
    I'd like to know why my little regex while() loop isn't working.

    Here's a sample XHTML code snippet that I'm working on:

    Code:
    <pre>
    <li>description <a href="overview_mh.html#overview">(1)</a>, <a
    href="catalog.html#catalog">(2)</a>
    </pre>[code]
    
    My goal:
    
    1. Read a line of an XHTML file.
    2. If it contains an href= attribute, grab everything up to the
    opening href quote and print it out to a file.
    3. Read the href attribute itself, change it and print the changed
    version out to the file.
    4. Delete everything from the beginning of the line up to the closing
    quote of the href attribute.
    5. Check to see if there's another href on the line; if so, repeat. If
    not, print out the rest to the file.
    
    Here's a code snippet:
    
    [code]<pre>
    while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){	#"
    my $prev =$1;
    my $href =$2;
    print NEW $prev;
    # do some stuff to the href
    # ...
    print NEW $href;
    # remove both $prev and $href from $line and continue
    print "VALUE OF prev: $prev\n";
    print " BEFORE: $line";
    $line =~ s/^$prev//;
    print " AFTER: $line";
    $line =~ s/^$href//;
    }
    print $line;
    </pre>
    For some reason, the s/^$prev//; isn't working; The result looks
    something like this:

    VALUE OF prev: ">(1)</a>, <a href="
    BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
    AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>

    Why isn't it deleting the value of $prev in $line?

    Meanwhile, yes I'm looking at one of the parsers.

    thanks,
    Scott
     
    sstark, Jun 12, 2009
    #1
    1. Advertising

  2. sstark

    Jim Gibson Guest

    In article <>, Ben Morrow
    <> wrote:

    > Quoth sstark <>:
    > <snip>



    > > For some reason, the s/^$prev//; isn't working; The result looks
    > > something like this:
    > >
    > > VALUE OF prev: ">(1)</a>, <a href="
    > > BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
    > > AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
    > >
    > > Why isn't it deleting the value of $prev in $line?

    >
    > $prev contains regex metacharacters, in this case the parens, which
    > don't match literally. As on any occasion when you want to match a
    > string literally, you need \Q:
    >
    > $line =~ s/^\Q$prev//;


    Or avoid problems with regexs altogether by using substr:

    substr($line,0,length($prev),'');

    --
    Jim Gibson
     
    Jim Gibson, Jun 12, 2009
    #2
    1. Advertising

  3. sstark

    sstark Guest

    Thanks Ben and Jim, just the answers I needed.

    Scott
     
    sstark, Jun 13, 2009
    #3
  4. sstark

    Guest

    On Fri, 12 Jun 2009 12:20:05 -0700 (PDT), sstark <> wrote:

    >Hi all, I know I will be told that I should be using one of the HTML
    >or XML parsers to perform this task, and you're right! :) but still
    >I'd like to know why my little regex while() loop isn't working.
    >
    >Here's a sample XHTML code snippet that I'm working on:
    >
    >
    Code:
    <pre>
    ><li>description <a href="overview_mh.html#overview">(1)</a>, <a
    >href="catalog.html#catalog">(2)</a>
    ></pre>[code]
    >
    >My goal:
    >
    >1. Read a line of an XHTML file.
    >2. If it contains an href= attribute, grab everything up to the
    >opening href quote and print it out to a file.
    >3. Read the href attribute itself, change it and print the changed
    >version out to the file.
    >4. Delete everything from the beginning of the line up to the closing
    >quote of the href attribute.
    >5. Check to see if there's another href on the line; if so, repeat. If
    >not, print out the rest to the file.
    >
    >Here's a code snippet:
    >
    >[code]<pre>
    >   while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){	#"[/color]
    ^     ^     ^
    Don't need to escape these dbl quotes.
    and you probably need /is at the end.
    plus there doesen't seem to be a need to have capture #3
    or the .* in #3's group since it looks like your buffering
    $line .= <DATA> somewhere. If so, the line will always contain
    the remainder after substitution and pos() will always be reset.
    I could be wrong.
    [color=blue]
    >    my $prev =$1;
    >    my $href =$2;
    >    print NEW $prev;
    >    # do some stuff to the href
    >    # ...
    >    print NEW $href;
    >    # remove both $prev and $href from $line and continue
    >    print "VALUE OF prev: $prev\n";
    >    print " BEFORE: $line";
    >    $line =~ s/^$prev//;
    >    print " AFTER: $line";
    >    $line =~ s/^$href//;
    >   }
    >  print $line;
    ></pre>
    >
    >For some reason, the s/^$prev//; isn't working; The result looks
    >something like this:
    >
    >VALUE OF prev: ">(1)</a>, <a href="
    >BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
    >AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
    >
    >Why isn't it deleting the value of $prev in $line?
    >
    >Meanwhile, yes I'm looking at one of the parsers.
    >
    >thanks,
    >Scott


    Somebody already mentioned the parenths metachars as a
    pattern in the substitution. So thats your problem.

    However, it might be less expensive if you slurped in the
    whole file and avoid the substitution altogether. But, the
    substitution is a easy buffering mechanism, albeit expensive
    on time.

    -sln

    -------------
    use strict;
    use warnings;

    # /(?:(.*?href\s*=\s*)(["'])(.*?)(\2))|(.*)/isg

    my $line = join ( '', <DATA>);
    my $count = 1;

    while( $line =~ /
    (?:
    (.*?href\s*=\s*) # $1 all up to the next and including href=
    (["']) # $2 " or ' (1+2 = prev)
    (.*?) # $3 attribute value (unquoted)
    (\2) # $4 what $2 is (" or ')
    )
    | # or
    (.*) # $5 the remainder (only hit once)
    /xisg)
    {
    print "Pass ".$count++.":\n------------\n";
    if (defined $1) {
    print "prev:\n".$1.$2."\n";
    print "val:\n".$3."\n";
    print "end:\n".$4."\n";
    }
    if (defined $5) {
    print "final:\n".$5."\n";
    }
    }
    __DATA__
    <pre>
    <li>description <a href="overview_mh.html#overview">(1)</a>,
    <a href="catalog.html#catalog">(2)</a>
    </pre>
     
    , Jun 13, 2009
    #4
  5. >>>>> "s" == sstark <> writes:

    s> Why isn't it deleting the value of $prev in $line?

    Because ^ doesn't do what you think it does, and it only works in
    the code you have there out of pure luck and coincidence.

    Charlton


    --
    Charlton Wilbur
     
    Charlton Wilbur, Jun 13, 2009
    #5
  6. sstark

    Guest

    On Sat, 13 Jun 2009 02:11:44 -0700, wrote:

    >On Fri, 12 Jun 2009 12:20:05 -0700 (PDT), sstark <> wrote:
    >
    >>Hi all, I know I will be told that I should be using one of the HTML
    >>or XML parsers to perform this task, and you're right! :) but still
    >>I'd like to know why my little regex while() loop isn't working.
    >>
    >>Here's a sample XHTML code snippet that I'm working on:
    >>
    >>
    Code:
    <pre>
    >><li>description <a href="overview_mh.html#overview">(1)</a>, <a
    >>href="catalog.html#catalog">(2)</a>
    >></pre>[code]
    >>
    >>My goal:
    >>
    >>1. Read a line of an XHTML file.
    >>2. If it contains an href= attribute, grab everything up to the
    >>opening href quote and print it out to a file.
    >>3. Read the href attribute itself, change it and print the changed
    >>version out to the file.
    >>4. Delete everything from the beginning of the line up to the closing
    >>quote of the href attribute.
    >>5. Check to see if there's another href on the line; if so, repeat. If
    >>not, print out the rest to the file.
    >>
    >>Here's a code snippet:
    >>
    >>[code]<pre>
    >>   while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){	#"[/color]
    >                                    ^     ^     ^
    >Don't need to escape these dbl quotes.
    >and you probably need /is at the end.
    >plus there doesen't seem to be a need to have capture #3
    >or the .* in #3's group since it looks like your buffering
    >$line .= <DATA> somewhere. If so, the line will always contain
    >the remainder after substitution and pos() will always be reset.
    >I could be wrong.
    >[color=green]
    >>    my $prev =$1;
    >>    my $href =$2;
    >>    print NEW $prev;
    >>    # do some stuff to the href
    >>    # ...
    >>    print NEW $href;
    >>    # remove both $prev and $href from $line and continue
    >>    print "VALUE OF prev: $prev\n";
    >>    print " BEFORE: $line";
    >>    $line =~ s/^$prev//;
    >>    print " AFTER: $line";
    >>    $line =~ s/^$href//;
    >>   }
    >>  print $line;
    >></pre>
    >>
    >>For some reason, the s/^$prev//; isn't working; The result looks
    >>something like this:
    >>
    >>VALUE OF prev: ">(1)</a>, <a href="
    >>BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
    >>AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
    >>
    >>Why isn't it deleting the value of $prev in $line?
    >>
    >>Meanwhile, yes I'm looking at one of the parsers.
    >>
    >>thanks,
    >>Scott

    >
    >Somebody already mentioned the parenths metachars as a
    >pattern in the substitution. So thats your problem.
    >
    >However, it might be less expensive if you slurped in the
    >whole file and avoid the substitution altogether. But, the
    >substitution is a easy buffering mechanism, albeit expensive
    >on time.
    >

    Better written as:

    use strict;
    use warnings;

    # /(?:(.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg

    my $line = join ( '', <DATA>);
    my $count = 1;

    while( $line =~ /
    (?:
    (.*?href\s*=\s*) # $1 all up to the next and including href=
    (["']) # $2 " or ' (1+2 = prev)
    (.*?) # $3 attribute value (unquoted)
    \2 # what $2 is (" or ')
    )
    | # or
    (.+) # $4 the remainder (no more href)
    /xisg)
    {
    print "Pass ".$count++.":\n------------\n";
    if (defined $1) {
    print "prev:\n".$1.$2."\n"; # previous plus quote, print to file
    print "val:\n".$3."\n"; # unquoted href value, modify & print to file
    print "end:\n".$2."\n"; # closing quote, print to file
    }
    if (defined $4) {
    print "final:\n".$4."\n"; # remainder, no hrefs, print to file
    }
    }


    __DATA__
    <pre>
    <li>description <a href="overview_mh.html#overview">(1)</a>,
    <a href="catalog.html#catalog">(2)</a>
    </pre>
     
    , Jun 14, 2009
    #6
  7. >>>>> "BM" == Ben Morrow <> writes:

    BM> Quoth Charlton Wilbur <>:
    >> >>>>> "s" == sstark <> writes:


    BM> [ sstark's code was

    BM> $line =~ s/^$prev//;

    BM> ]

    s> Why isn't it deleting the value of $prev in $line?

    >> Because ^ doesn't do what you think it does, and it only works in
    >> the code you have there out of pure luck and coincidence.


    BM> Please explain further. ^ means 'match at the beginning of the
    BM> string', unless /m is given, in which case it means 'match at
    BM> the beginning of any line'. How is this not what the OP thought
    BM> it meant?

    Because the strings he wants are at the start of the strings he's
    looking at purely by accident -- it's not part of his specification.

    Charlton

    --
    Charlton Wilbur
     
    Charlton Wilbur, Jun 14, 2009
    #7
  8. sstark

    Guest

    On Sun, 14 Jun 2009 15:26:16 -0400, Charlton Wilbur <> wrote:

    >>>>>> "BM" == Ben Morrow <> writes:

    >
    > BM> Quoth Charlton Wilbur <>:
    > >> >>>>> "s" == sstark <> writes:

    >
    > BM> [ sstark's code was
    >
    > BM> $line =~ s/^$prev//;
    >
    > BM> ]
    >
    > s> Why isn't it deleting the value of $prev in $line?
    >
    > >> Because ^ doesn't do what you think it does, and it only works in
    > >> the code you have there out of pure luck and coincidence.

    >
    > BM> Please explain further. ^ means 'match at the beginning of the
    > BM> string', unless /m is given, in which case it means 'match at
    > BM> the beginning of any line'. How is this not what the OP thought
    > BM> it meant?
    >
    >Because the strings he wants are at the start of the strings he's
    >looking at purely by accident -- it's not part of his specification.
    >
    >Charlton


    I don't think its by coincidence.
    The carret ^ in this:

    /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i

    is not actually needed since .*? will only grab the first instance of the
    matching pattern, everything from the beginning of the line.
    However he needs /is. So the basic principle is sound, still it doesen't
    matter if the ^ is there or not. No global modifier anyway, the pos() of
    the match is not remembered in the while, search will be renewed
    at position 0.

    On top of that the contents of (.*?href\s*=\s*\") is used as a pattern
    in a later substitution regex (but he had a metachar quoting problem). The line
    is still the same until he does the substitution, but he needs to quote metachar
    when using the capture from the while regex, as a pattern in the lower substitution
    regex. It finds the exact same thing because the line didn't change and it would
    be the first found in the substitution regex.

    The fact that he does this in a while() statement is misleading until you
    read it a little more closely.

    What he is in fact trying to do is a cheap buffering method that appends file
    stream data, line by line, to the buffer ($line) after the substitution, thus
    avoiding the global modifier /g and non-substitution.

    His code doesen't show it, but it would probably have to do something like below
    for it to work.

    while (defined ($buff = <DATA>))
    {
    $line .= $buff;
    while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){ #"
    while($line =~ /^(.*?href\s*=\s*")([^"]+)"/is) # fixed up
    {
    my $prev =$1;
    my $href =$2;
    $prev = quotemeta $prev; # the fix
    $href = quotemeta $href; # the fix
    $line =~ s/^$prev//;
    $line =~ s/^$href//;
    }
    }
    # see whats left in $line

    In reality the ^ shouldn't be needed, but doesen't seem to hurt.

    This method is pretty slow, but its cheap buffering. The alternative is
    something like below.

    -sln

    -------------
    use strict;
    use warnings;

    # /(?:(.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg

    my ($line,$buff) = (''.'');
    my $count = 1;

    while (defined ($_ = <DATA>))
    {
    $line = $buff.$_;
    while( $line =~ /(?:(.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg )
    {
    if (defined $1) {
    print "Pass ".$count++.":\n------------\n";
    print "prev:\n".$1.$2."\n"; # previous, print to file
    print "val:\n".$3."\n"; # href, modify & print to file
    print "end:\n".$2."\n"; # closing quote, print to file
    }
    elsif (defined $4) {
    $buff = $4; # remainder, buffer it
    }
    }
    }
    if (length $buff) {
    print "Pass ".$count++.":\n------------\n";
    print "buff:\n".$buff."\n"; # remainding buffer, print to file
    }

    __DATA__
    <pre>
    some junk content
    <li>description <a href
    ="overview_mh.html#overview">(1)</a>,
    <a href = 'catalog.html#catalog'>(2)</a>
    </pre>
     
    , Jun 15, 2009
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    747
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,693
    Ant...
    Nov 6, 2003
  3. Replies:
    2
    Views:
    630
  4. Xah Lee
    Replies:
    1
    Views:
    973
    Ilias Lazaridis
    Sep 22, 2006
  5. Replies:
    3
    Views:
    834
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page