regex problem

S

sstark

Hi all, I know I will be told that I should be using one of the HTML
or XML parsers to perform this task, and you're right! :) but still
I'd like to know why my little regex while() loop isn't working.

Here's a sample XHTML code snippet that I'm working on:

Code:
<pre>
<li>description <a href="overview_mh.html#overview">(1)</a>, <a
href="catalog.html#catalog">(2)</a>
</pre>[code]

My goal:

1. Read a line of an XHTML file.
2. If it contains an href= attribute, grab everything up to the
opening href quote and print it out to a file.
3. Read the href attribute itself, change it and print the changed
version out to the file.
4. Delete everything from the beginning of the line up to the closing
quote of the href attribute.
5. Check to see if there's another href on the line; if so, repeat. If
not, print out the rest to the file.

Here's a code snippet:

[code]<pre>
while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){	#"
my $prev =$1;
my $href =$2;
print NEW $prev;
# do some stuff to the href
# ...
print NEW $href;
# remove both $prev and $href from $line and continue
print "VALUE OF prev: $prev\n";
print " BEFORE: $line";
$line =~ s/^$prev//;
print " AFTER: $line";
$line =~ s/^$href//;
}
print $line;
</pre>

For some reason, the s/^$prev//; isn't working; The result looks
something like this:

VALUE OF prev: ">(1)</a>, <a href="
BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>

Why isn't it deleting the value of $prev in $line?

Meanwhile, yes I'm looking at one of the parsers.

thanks,
Scott
 
J

Jim Gibson

Ben Morrow said:
Quoth sstark <[email protected]>:
<snip>


$prev contains regex metacharacters, in this case the parens, which
don't match literally. As on any occasion when you want to match a
string literally, you need \Q:

$line =~ s/^\Q$prev//;

Or avoid problems with regexs altogether by using substr:

substr($line,0,length($prev),'');
 
S

sln

Hi all, I know I will be told that I should be using one of the HTML
or XML parsers to perform this task, and you're right! :) but still
I'd like to know why my little regex while() loop isn't working.

Here's a sample XHTML code snippet that I'm working on:

Code:
<pre>
<li>description <a href="overview_mh.html#overview">(1)</a>, <a
href="catalog.html#catalog">(2)</a>
</pre>[code]

My goal:

1. Read a line of an XHTML file.
2. If it contains an href= attribute, grab everything up to the
opening href quote and print it out to a file.
3. Read the href attribute itself, change it and print the changed
version out to the file.
4. Delete everything from the beginning of the line up to the closing
quote of the href attribute.
5. Check to see if there's another href on the line; if so, repeat. If
not, print out the rest to the file.

Here's a code snippet:

[code]<pre>
while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){	#"[/QUOTE]
^     ^     ^
Don't need to escape these dbl quotes.
and you probably need /is at the end.
plus there doesen't seem to be a need to have capture #3
or the .* in #3's group since it looks like your buffering
$line .= <DATA> somewhere. If so, the line will always contain
the remainder after substitution and pos() will always be reset.
I could be wrong.
[QUOTE]
my $prev =$1;
my $href =$2;
print NEW $prev;
# do some stuff to the href
# ...
print NEW $href;
# remove both $prev and $href from $line and continue
print "VALUE OF prev: $prev\n";
print " BEFORE: $line";
$line =~ s/^$prev//;
print " AFTER: $line";
$line =~ s/^$href//;
}
print $line;
</pre>

For some reason, the s/^$prev//; isn't working; The result looks
something like this:

VALUE OF prev: ">(1)</a>, <a href="
BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>

Why isn't it deleting the value of $prev in $line?

Meanwhile, yes I'm looking at one of the parsers.

thanks,
Scott

Somebody already mentioned the parenths metachars as a
pattern in the substitution. So thats your problem.

However, it might be less expensive if you slurped in the
whole file and avoid the substitution altogether. But, the
substitution is a easy buffering mechanism, albeit expensive
on time.

-sln

-------------
use strict;
use warnings;

# /(?:(.*?href\s*=\s*)(["'])(.*?)(\2))|(.*)/isg

my $line = join ( '', <DATA>);
my $count = 1;

while( $line =~ /
(?:
(.*?href\s*=\s*) # $1 all up to the next and including href=
(["']) # $2 " or ' (1+2 = prev)
(.*?) # $3 attribute value (unquoted)
(\2) # $4 what $2 is (" or ')
)
| # or
(.*) # $5 the remainder (only hit once)
/xisg)
{
print "Pass ".$count++.":\n------------\n";
if (defined $1) {
print "prev:\n".$1.$2."\n";
print "val:\n".$3."\n";
print "end:\n".$4."\n";
}
if (defined $5) {
print "final:\n".$5."\n";
}
}
__DATA__
<pre>
<li>description <a href="overview_mh.html#overview">(1)</a>,
<a href="catalog.html#catalog">(2)</a>
</pre>
 
C

Charlton Wilbur

s> Why isn't it deleting the value of $prev in $line?

Because ^ doesn't do what you think it does, and it only works in
the code you have there out of pure luck and coincidence.

Charlton
 
S

sln

Hi all, I know I will be told that I should be using one of the HTML
or XML parsers to perform this task, and you're right! :) but still
I'd like to know why my little regex while() loop isn't working.

Here's a sample XHTML code snippet that I'm working on:

Code:
<pre>
<li>description <a href="overview_mh.html#overview">(1)</a>, <a
href="catalog.html#catalog">(2)</a>
</pre>[code]

My goal:

1. Read a line of an XHTML file.
2. If it contains an href= attribute, grab everything up to the
opening href quote and print it out to a file.
3. Read the href attribute itself, change it and print the changed
version out to the file.
4. Delete everything from the beginning of the line up to the closing
quote of the href attribute.
5. Check to see if there's another href on the line; if so, repeat. If
not, print out the rest to the file.

Here's a code snippet:

[code]<pre>
while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){	#"[/QUOTE]
^     ^     ^
Don't need to escape these dbl quotes.
and you probably need /is at the end.
plus there doesen't seem to be a need to have capture #3
or the .* in #3's group since it looks like your buffering
$line .= <DATA> somewhere. If so, the line will always contain
the remainder after substitution and pos() will always be reset.
I could be wrong.
[QUOTE]
my $prev =$1;
my $href =$2;
print NEW $prev;
# do some stuff to the href
# ...
print NEW $href;
# remove both $prev and $href from $line and continue
print "VALUE OF prev: $prev\n";
print " BEFORE: $line";
$line =~ s/^$prev//;
print " AFTER: $line";
$line =~ s/^$href//;
}
print $line;
</pre>

For some reason, the s/^$prev//; isn't working; The result looks
something like this:

VALUE OF prev: ">(1)</a>, <a href="
BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>

Why isn't it deleting the value of $prev in $line?

Meanwhile, yes I'm looking at one of the parsers.

thanks,
Scott

Somebody already mentioned the parenths metachars as a
pattern in the substitution. So thats your problem.

However, it might be less expensive if you slurped in the
whole file and avoid the substitution altogether. But, the
substitution is a easy buffering mechanism, albeit expensive
on time.
Better written as:

use strict;
use warnings;

# /(?:(.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg

my $line = join ( '', <DATA>);
my $count = 1;

while( $line =~ /
(?:
(.*?href\s*=\s*) # $1 all up to the next and including href=
(["']) # $2 " or ' (1+2 = prev)
(.*?) # $3 attribute value (unquoted)
\2 # what $2 is (" or ')
)
| # or
(.+) # $4 the remainder (no more href)
/xisg)
{
print "Pass ".$count++.":\n------------\n";
if (defined $1) {
print "prev:\n".$1.$2."\n"; # previous plus quote, print to file
print "val:\n".$3."\n"; # unquoted href value, modify & print to file
print "end:\n".$2."\n"; # closing quote, print to file
}
if (defined $4) {
print "final:\n".$4."\n"; # remainder, no hrefs, print to file
}
}


__DATA__
<pre>
<li>description <a href="overview_mh.html#overview">(1)</a>,
<a href="catalog.html#catalog">(2)</a>
</pre>
 
C

Charlton Wilbur

BM> [ sstark's code was

BM> $line =~ s/^$prev//;

BM> ]

s> Why isn't it deleting the value of $prev in $line?

BM> Please explain further. ^ means 'match at the beginning of the
BM> string', unless /m is given, in which case it means 'match at
BM> the beginning of any line'. How is this not what the OP thought
BM> it meant?

Because the strings he wants are at the start of the strings he's
looking at purely by accident -- it's not part of his specification.

Charlton
 
S

sln

BM> [ sstark's code was

BM> $line =~ s/^$prev//;

BM> ]

s> Why isn't it deleting the value of $prev in $line?

BM> Please explain further. ^ means 'match at the beginning of the
BM> string', unless /m is given, in which case it means 'match at
BM> the beginning of any line'. How is this not what the OP thought
BM> it meant?

Because the strings he wants are at the start of the strings he's
looking at purely by accident -- it's not part of his specification.

Charlton

I don't think its by coincidence.
The carret ^ in this:

/^(.*?href\s*=\s*\")([^\"]+)(\".*)/i

is not actually needed since .*? will only grab the first instance of the
matching pattern, everything from the beginning of the line.
However he needs /is. So the basic principle is sound, still it doesen't
matter if the ^ is there or not. No global modifier anyway, the pos() of
the match is not remembered in the while, search will be renewed
at position 0.

On top of that the contents of (.*?href\s*=\s*\") is used as a pattern
in a later substitution regex (but he had a metachar quoting problem). The line
is still the same until he does the substitution, but he needs to quote metachar
when using the capture from the while regex, as a pattern in the lower substitution
regex. It finds the exact same thing because the line didn't change and it would
be the first found in the substitution regex.

The fact that he does this in a while() statement is misleading until you
read it a little more closely.

What he is in fact trying to do is a cheap buffering method that appends file
stream data, line by line, to the buffer ($line) after the substitution, thus
avoiding the global modifier /g and non-substitution.

His code doesen't show it, but it would probably have to do something like below
for it to work.

while (defined ($buff = <DATA>))
{
$line .= $buff;
while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){ #"
while($line =~ /^(.*?href\s*=\s*")([^"]+)"/is) # fixed up
{
my $prev =$1;
my $href =$2;
$prev = quotemeta $prev; # the fix
$href = quotemeta $href; # the fix
$line =~ s/^$prev//;
$line =~ s/^$href//;
}
}
# see whats left in $line

In reality the ^ shouldn't be needed, but doesen't seem to hurt.

This method is pretty slow, but its cheap buffering. The alternative is
something like below.

-sln

-------------
use strict;
use warnings;

# /(?:(.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg

my ($line,$buff) = (''.'');
my $count = 1;

while (defined ($_ = <DATA>))
{
$line = $buff.$_;
while( $line =~ /(?:(.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg )
{
if (defined $1) {
print "Pass ".$count++.":\n------------\n";
print "prev:\n".$1.$2."\n"; # previous, print to file
print "val:\n".$3."\n"; # href, modify & print to file
print "end:\n".$2."\n"; # closing quote, print to file
}
elsif (defined $4) {
$buff = $4; # remainder, buffer it
}
}
}
if (length $buff) {
print "Pass ".$count++.":\n------------\n";
print "buff:\n".$buff."\n"; # remainding buffer, print to file
}

__DATA__
<pre>
some junk content
<li>description <a href
="overview_mh.html#overview">(1)</a>,
<a href = 'catalog.html#catalog'>(2)</a>
</pre>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top