'+' messing up regular expression

C

Chris Johnson

I've written a CGI script that basically emulates the Apache default
page, but with more customizations. One of these is the addition of
content above the file list, and I've decided to use Wikipedia-esque
shorthand.

I've got it pretty much working. Except there are some problems with
the link conversion. (In case you've never seen it,
[[http://www.google.com|Google]] translates to <a
href="http://www.google.com">Google</a>)

I've found that if there's a '+' in the string to be replaced, it
simply won't be replaced. Here's the code that works on most every
situation:

while(/\[\[(.*?)\]\]/g){
$new = $1;
if($new =~ s/^(.*)\|(.*)$/<a href="$1">$2<\/a>/){
s/\[\[$1\|$2\]\]/$new/g;
}
}

The specific input that's having trouble is

[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]

but the peculiar thing is that if I remove the +'s, it makes the
replacement fine (except for the fact that the link is no longer
valid). So does anyone see why this is happening?

Thanks for your time,
Chris
 
A

A. Sinan Unur

while(/\[\[(.*?)\]\]/g){
$new = $1;
if($new =~ s/^(.*)\|(.*)$/<a href="$1">$2<\/a>/){
s/\[\[$1\|$2\]\]/$new/g;
}
}

The specific input that's having trouble is

[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]

#!/usr/bin/perl

use strict;
use warnings;

my $s = '[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]';

if($s =~ /^\[\[(.+)\|(.+)\]\]$/) {
print qq{<a href="$1">$2</a>\n};
}

__END__

D:\Home\asu1\UseNet\clpmisc> c
<a href="http://fy.chalmers.se/~appro/linux/DVD+RW/">dvd+rw-tools</a>
 
C

Chris Johnson

A. Sinan Unur said:
while(/\[\[(.*?)\]\]/g){
$new = $1;
if($new =~ s/^(.*)\|(.*)$/<a href="$1">$2<\/a>/){
s/\[\[$1\|$2\]\]/$new/g;
}
}

The specific input that's having trouble is

[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]

#!/usr/bin/perl

use strict;
use warnings;

my $s = '[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]';

if($s =~ /^\[\[(.+)\|(.+)\]\]$/) {
print qq{<a href="$1">$2</a>\n};
}

__END__

D:\Home\asu1\UseNet\clpmisc> c
<a href="http://fy.chalmers.se/~appro/linux/DVD+RW/">dvd+rw-tools</a>

I should clarify, it seems. The input is a text file. I do not simply
want to print the matched patterns; I want to replace the text, and
then print the entire contents of the file. What I'm curious about is
why it won't run the s/$old/$new/g if there's a '+' in $old.

Incidentally, if I change the code to:

while(/\[\[(.*?)\]\]/g){
$new = $1;
if($new =~ s/^(.*)\|(.*)$/<a href="$1">$2<\/a>/){
$old = "[[$1|$2]]";
s/$old/$new/g;
}
}

I get the following error:

Invalid [] range "w-t" in regex; marked by <-- HERE in
m/[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-t <-- HERE
ools]]/ at index.cgi line 89.
 
A

A. Sinan Unur

A. Sinan Unur said:
while(/\[\[(.*?)\]\]/g){
$new = $1;
if($new =~ s/^(.*)\|(.*)$/<a href="$1">$2<\/a>/){
s/\[\[$1\|$2\]\]/$new/g;
}
}

The specific input that's having trouble is

[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]

#!/usr/bin/perl

use strict;
use warnings;

my $s =
'[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]';

if($s =~ /^\[\[(.+)\|(.+)\]\]$/) {
print qq{<a href="$1">$2</a>\n};
}

__END__

D:\Home\asu1\UseNet\clpmisc> c
<a href="http://fy.chalmers.se/~appro/linux/DVD+RW/">dvd+rw-tools</a>

I should clarify, it seems. The input is a text file. I do not simply
want to print the matched patterns; I want to replace the text, and
then print the entire contents of the file. What I'm curious about is
why it won't run the s/$old/$new/g if there's a '+' in $old.

Because + and - are special in regexes.

It seems like you need to read the docs.

From perldoc perlop:

\Q quote non-word characters till \E

So, for example:

use strict ;
use warnings;

my $test = 'Sinan+Unur';
my $old = '+';
my $new = ' ';

$test =~ s/$old/$new/g;

print "$test\n";


__END__

D:\Home\asu1\UseNet\clpmisc> c
Quantifier follows nothing in regex; marked by <-- HERE in m/+ <-- HERE
/ at D:\Home\asu1\UseNet\clpmisc\c.pl line 8.

Whereas:

use strict ;
use warnings;

my $test = 'Sinan+Unur';
my $old = '+';
my $new = ' ';

$test =~ s/\Q$old\E/$new/g;

print "$test\n";

__END__

D:\Home\asu1\UseNet\clpmisc> c
Sinan Unur
 
C

Chris Johnson

A. Sinan Unur said:
while(/\[\[(.*?)\]\]/g){
$new = $1;
if($new =~ s/^(.*)\|(.*)$/<a href="$1">$2<\/a>/){
s/\[\[$1\|$2\]\]/$new/g;
}
}

The specific input that's having trouble is

[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]

#!/usr/bin/perl

use strict;
use warnings;

my $s = '[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]';

if($s =~ /^\[\[(.+)\|(.+)\]\]$/) {
print qq{<a href="$1">$2</a>\n};
}

__END__

D:\Home\asu1\UseNet\clpmisc> c
<a href="http://fy.chalmers.se/~appro/linux/DVD+RW/">dvd+rw-tools</a>

I should clarify, it seems. The input is a text file. I do not simply
want to print the matched patterns; I want to replace the text, and
then print the entire contents of the file. What I'm curious about is
why it won't run the s/$old/$new/g if there's a '+' in $old.

Incidentally, if I change the code to:

while(/\[\[(.*?)\]\]/g){
$new = $1;
if($new =~ s/^(.*)\|(.*)$/<a href="$1">$2<\/a>/){
$old = "[[$1|$2]]";
s/$old/$new/g;
}
}

I get the following error:

Invalid [] range "w-t" in regex; marked by <-- HERE in
m/[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-t <-- HERE
ools]]/ at index.cgi line 89.
 
C

Chris Johnson

Thank you. I was under the impression that those characters only made a
difference if they were typed explicitly, but not if they were part of
a variable.
 
J

Jürgen Exner

Chris Johnson wrote:
[...]
then print the entire contents of the file. What I'm curious about is
why it won't run the s/$old/$new/g if there's a '+' in $old.

Well, it does, but probably you didn't mean to use the '+' sign to indicate
one or more instances of the preceeding unit in the RE.
Like in /a+/ matches any non-empty sequence of the letter 'a'.
Incidentally, if I change the code to:

I get the following error:

Invalid [] range "w-t" in regex;

Well, yeah, how many characters are there between 'w' and 't'? Note: I
didn't ask for characters between 't' and 'w'.

I strongly recommend you familiarize yourself with regular expressions.
"perldoc perlretut" is a reasonably good introduction.

jue
 
T

Tad McClellan

A. Sinan Unur said:
Because + and - are special in regexes.


Hyphen (-) is not meta in a regular expression, while plus (+) is meta.

Hyphen (-) is meta in a character class, while plus (+) is not meta.


We must peel our "language onion" to know what funny characters are funny.

We have a language inside of a language inside of a language. The
teeny-tiny character class language is inside of the larger regular
expression language which is inside of big ol' Perl.

So we must identify which language we are currently in before we
know what metacharacters apply.

eg:

Hyphen (-):

Perl: subtraction
RE: not meta
CC: range

Caret (^):

Perl: bitwise exclusive or
RE: beginning of string
CC: negates the class
 
T

T Beck

Chris Johnson wrote:
[snip early description
while(/\[\[(.*?)\]\]/g){
$new = $1;
if($new =~ s/^(.*)\|(.*)$/<a href="$1">$2<\/a>/){
s/\[\[$1\|$2\]\]/$new/g;
}
}

The specific input that's having trouble is

[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]

but the peculiar thing is that if I remove the +'s, it makes the
replacement fine (except for the fact that the link is no longer
valid). So does anyone see why this is happening?

Everyone's pointed out how it's happening... here's some code to get
around it. The trick is to not try to use what you get to do an entire
second substitution (Sinan alluded to this with his first post, but
this might be a more useable version for you)

#!/usr/bin/perl
use strict;
use warnings;

my $input =
q{[[http://fy.chalmers.se/~appro/linux/DVD+RW/|dvd+rw-tools]]
other text
[[http://www.google.com|google]] Final text};

$input =~ s/\[\[(.*?)\|(.*?)\]\]/<a href="$1">$2<\/a>/sg;

print "Output:\n$input\n";

../test.pl
Output:
<a href="http://fy.chalmers.se/~appro/linux/DVD+RW/">dvd+rw-tools</a>
other text
<a href="http://www.google.com">google</a> Final text


--T Beck
 
A

A. Sinan Unur

Hyphen (-) is not meta in a regular expression, while plus (+) is
meta.

Hyphen (-) is meta in a character class, while plus (+) is not meta.


We must peel our "language onion" to know what funny characters are
funny.

Absolutely. Thank you for the clarification.

Sinan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top