regex problem

sstark · Jun 12, 2009

Hi all, I know I will be told that I should be using one of the HTML
or XML parsers to perform this task, and you're right!

but still
I'd like to know why my little regex while() loop isn't working.

Here's a sample XHTML code snippet that I'm working on:

Code:

<pre>
<li>description <a href="overview_mh.html#overview">(1)</a>, <a
href="catalog.html#catalog">(2)</a>
</pre>[code]

My goal:

1. Read a line of an XHTML file.
2. If it contains an href= attribute, grab everything up to the
opening href quote and print it out to a file.
3. Read the href attribute itself, change it and print the changed
version out to the file.
4. Delete everything from the beginning of the line up to the closing
quote of the href attribute.
5. Check to see if there's another href on the line; if so, repeat. If
not, print out the rest to the file.

Here's a code snippet:

[code]<pre>
while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){	#"
my $prev =$1;
my $href =$2;
print NEW $prev;
# do some stuff to the href
# ...
print NEW $href;
# remove both $prev and $href from $line and continue
print "VALUE OF prev: $prev\n";
print " BEFORE: $line";
$line =~ s/^$prev//;
print " AFTER: $line";
$line =~ s/^$href//;
}
print $line;
</pre>

For some reason, the s/^$prev//; isn't working; The result looks
something like this:

VALUE OF prev: ">(1)</a>, <a href="
BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>

Why isn't it deleting the value of $prev in $line?

Meanwhile, yes I'm looking at one of the parsers.

thanks,
Scott

Jim Gibson · Jun 12, 2009

Ben Morrow said:
Quoth sstark <[email protected]>:
<snip>

$prev contains regex metacharacters, in this case the parens, which
don't match literally. As on any occasion when you want to match a
string literally, you need \Q:

$line =~ s/^\Q$prev//;

Or avoid problems with regexs altogether by using substr:

substr($line,0,length($prev),'');

sstark · Jun 13, 2009

Thanks Ben and Jim, just the answers I needed.

Scott

sln · Jun 13, 2009

Code:
Hi all, I know I will be told that I should be using one of the HTML
or XML parsers to perform this task, and you're right! but still
I'd like to know why my little regex while() loop isn't working.

Here's a sample XHTML code snippet that I'm working on:

Code:

<pre> <li>description <a href="overview_mh.html#overview">(1)</a>, <a href="catalog.html#catalog">(2)</a> </pre>[code] My goal: 1. Read a line of an XHTML file. 2. If it contains an href= attribute, grab everything up to the opening href quote and print it out to a file. 3. Read the href attribute itself, change it and print the changed version out to the file. 4. Delete everything from the beginning of the line up to the closing quote of the href attribute. 5. Check to see if there's another href on the line; if so, repeat. If not, print out the rest to the file. Here's a code snippet: [code]<pre> while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){ #"[/QUOTE] ^ ^ ^ Don't need to escape these dbl quotes. and you probably need /is at the end. plus there doesen't seem to be a need to have capture #3 or the .* in #3's group since it looks like your buffering $line .= <DATA> somewhere. If so, the line will always contain the remainder after substitution and pos() will always be reset. I could be wrong. [QUOTE] my $prev =$1; my $href =$2; print NEW $prev; # do some stuff to the href # ... print NEW $href; # remove both $prev and $href from $line and continue print "VALUE OF prev: $prev\n"; print " BEFORE: $line"; $line =~ s/^$prev//; print " AFTER: $line"; $line =~ s/^$href//; } print $line; </pre>

For some reason, the s/^$prev//; isn't working; The result looks
something like this:

VALUE OF prev: ">(1)</a>, <a href="
BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>

Why isn't it deleting the value of $prev in $line?

Meanwhile, yes I'm looking at one of the parsers.

thanks,
Scott

Somebody already mentioned the parenths metachars as a
pattern in the substitution. So thats your problem.

However, it might be less expensive if you slurped in the
whole file and avoid the substitution altogether. But, the
substitution is a easy buffering mechanism, albeit expensive
on time.

-sln

-------------
use strict;
use warnings;

# /(?

.*?href\s*=\s*)(["'])(.*?)(\2))|(.*)/isg

my $line = join ( '', <DATA>);
my $count = 1;

while( $line =~ /
(?:
(.*?href\s*=\s*) # $1 all up to the next and including href=
(["']) # $2 " or ' (1+2 = prev)
(.*?) # $3 attribute value (unquoted)
(\2) # $4 what $2 is (" or ')
)
| # or
(.*) # $5 the remainder (only hit once)
/xisg)
{
print "Pass ".$count++.":\n------------\n";
if (defined $1) {
print "prev:\n".$1.$2."\n";
print "val:\n".$3."\n";
print "end:\n".$4."\n";
}
if (defined $5) {
print "final:\n".$5."\n";
}
}
__DATA__
<pre>
<li>description <a href="overview_mh.html#overview">(1)</a>,
<a href="catalog.html#catalog">(2)</a>
</pre>

Charlton Wilbur · Jun 13, 2009

s> Why isn't it deleting the value of $prev in $line?

Because ^ doesn't do what you think it does, and it only works in
the code you have there out of pure luck and coincidence.

Charlton

sln · Jun 14, 2009

Code:
Hi all, I know I will be told that I should be using one of the HTML
or XML parsers to perform this task, and you're right! but still
I'd like to know why my little regex while() loop isn't working.

Here's a sample XHTML code snippet that I'm working on:

Code:

<pre> <li>description <a href="overview_mh.html#overview">(1)</a>, <a href="catalog.html#catalog">(2)</a> </pre>[code] My goal: 1. Read a line of an XHTML file. 2. If it contains an href= attribute, grab everything up to the opening href quote and print it out to a file. 3. Read the href attribute itself, change it and print the changed version out to the file. 4. Delete everything from the beginning of the line up to the closing quote of the href attribute. 5. Check to see if there's another href on the line; if so, repeat. If not, print out the rest to the file. Here's a code snippet: [code]<pre> while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){ #"[/QUOTE] ^ ^ ^ Don't need to escape these dbl quotes. and you probably need /is at the end. plus there doesen't seem to be a need to have capture #3 or the .* in #3's group since it looks like your buffering $line .= <DATA> somewhere. If so, the line will always contain the remainder after substitution and pos() will always be reset. I could be wrong. [QUOTE] my $prev =$1; my $href =$2; print NEW $prev; # do some stuff to the href # ... print NEW $href; # remove both $prev and $href from $line and continue print "VALUE OF prev: $prev\n"; print " BEFORE: $line"; $line =~ s/^$prev//; print " AFTER: $line"; $line =~ s/^$href//; } print $line; </pre>

For some reason, the s/^$prev//; isn't working; The result looks
something like this:

VALUE OF prev: ">(1)</a>, <a href="
BEFORE: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>
AFTER: ">(1)</a>, <a href="catalog.html#catalog">(2)</a>

Why isn't it deleting the value of $prev in $line?

Meanwhile, yes I'm looking at one of the parsers.

thanks,
Scott

Click to expand...

Somebody already mentioned the parenths metachars as a
pattern in the substitution. So thats your problem.

However, it might be less expensive if you slurped in the
whole file and avoid the substitution altogether. But, the
substitution is a easy buffering mechanism, albeit expensive
on time.

Better written as:

use strict;
use warnings;

# /(?

.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg

my $line = join ( '', <DATA>);
my $count = 1;

while( $line =~ /
(?:
(.*?href\s*=\s*) # $1 all up to the next and including href=
(["']) # $2 " or ' (1+2 = prev)
(.*?) # $3 attribute value (unquoted)
\2 # what $2 is (" or ')
)
| # or
(.+) # $4 the remainder (no more href)
/xisg)
{
print "Pass ".$count++.":\n------------\n";
if (defined $1) {
print "prev:\n".$1.$2."\n"; # previous plus quote, print to file
print "val:\n".$3."\n"; # unquoted href value, modify & print to file
print "end:\n".$2."\n"; # closing quote, print to file
}
if (defined $4) {
print "final:\n".$4."\n"; # remainder, no hrefs, print to file
}
}

__DATA__
<pre>
<li>description <a href="overview_mh.html#overview">(1)</a>,
<a href="catalog.html#catalog">(2)</a>
</pre>

Charlton Wilbur · Jun 14, 2009

BM> [ sstark's code was

BM> $line =~ s/^$prev//;

BM> ]

s> Why isn't it deleting the value of $prev in $line?

BM> Please explain further. ^ means 'match at the beginning of the
BM> string', unless /m is given, in which case it means 'match at
BM> the beginning of any line'. How is this not what the OP thought
BM> it meant?

Because the strings he wants are at the start of the strings he's
looking at purely by accident -- it's not part of his specification.

Charlton

sln · Jun 15, 2009

BM> [ sstark's code was

BM> $line =~ s/^$prev//;

BM> ]

s> Why isn't it deleting the value of $prev in $line?

BM> Please explain further. ^ means 'match at the beginning of the
BM> string', unless /m is given, in which case it means 'match at
BM> the beginning of any line'. How is this not what the OP thought
BM> it meant?

Because the strings he wants are at the start of the strings he's
looking at purely by accident -- it's not part of his specification.

Charlton

I don't think its by coincidence.
The carret ^ in this:

/^(.*?href\s*=\s*\")([^\"]+)(\".*)/i

is not actually needed since .*? will only grab the first instance of the
matching pattern, everything from the beginning of the line.
However he needs /is. So the basic principle is sound, still it doesen't
matter if the ^ is there or not. No global modifier anyway, the pos() of
the match is not remembered in the while, search will be renewed
at position 0.

On top of that the contents of (.*?href\s*=\s*\") is used as a pattern
in a later substitution regex (but he had a metachar quoting problem). The line
is still the same until he does the substitution, but he needs to quote metachar
when using the capture from the while regex, as a pattern in the lower substitution
regex. It finds the exact same thing because the line didn't change and it would
be the first found in the substitution regex.

The fact that he does this in a while() statement is misleading until you
read it a little more closely.

What he is in fact trying to do is a cheap buffering method that appends file
stream data, line by line, to the buffer ($line) after the substitution, thus
avoiding the global modifier /g and non-substitution.

His code doesen't show it, but it would probably have to do something like below
for it to work.

while (defined ($buff = <DATA>))
{
$line .= $buff;
while($line =~ /^(.*?href\s*=\s*\")([^\"]+)(\".*)/i){ #"
while($line =~ /^(.*?href\s*=\s*")([^"]+)"/is) # fixed up
{
my $prev =$1;
my $href =$2;
$prev = quotemeta $prev; # the fix
$href = quotemeta $href; # the fix
$line =~ s/^$prev//;
$line =~ s/^$href//;
}
}
# see whats left in $line

In reality the ^ shouldn't be needed, but doesen't seem to hurt.

This method is pretty slow, but its cheap buffering. The alternative is
something like below.

-sln

-------------
use strict;
use warnings;

# /(?

.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg

my ($line,$buff) = (''.'');
my $count = 1;

while (defined ($_ = <DATA>))
{
$line = $buff.$_;
while( $line =~ /(?

.*?href\s*=\s*)(["'])(.*?)\2)|(.+)/isg )
{
if (defined $1) {
print "Pass ".$count++.":\n------------\n";
print "prev:\n".$1.$2."\n"; # previous, print to file
print "val:\n".$3."\n"; # href, modify & print to file
print "end:\n".$2."\n"; # closing quote, print to file
}
elsif (defined $4) {
$buff = $4; # remainder, buffer it
}
}
}
if (length $buff) {
print "Pass ".$count++.":\n------------\n";
print "buff:\n".$buff."\n"; # remainding buffer, print to file
}

__DATA__
<pre>
some junk content
<li>description <a href
="overview_mh.html#overview">(1)</a>,
<a href = 'catalog.html#catalog'>(2)</a>
</pre>

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Not sure why drop-down is not working.	2	Mar 24, 2024
Problem Splitting Text String	2	Dec 29, 2022
Clickable link conversion regex?	0	Nov 30, 2012
Regex: match double OR single quote	4	Jul 12, 2012
Regex replace problem	2	Jan 6, 2022
Align img inside nav tabs section	5	Dec 29, 2023
Different font sizes inside same div	2	Dec 3, 2023

regex problem

sstark

Jim Gibson

sstark

sln

Charlton Wilbur

sln

Charlton Wilbur

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads