User-defined substitution

P

Peter J. Holzer

I want to provide non-trusted users with a way to make substitutions
similar to s///. To my surprise this isn't an FAQ (although there is a
similar problem - "How can I expand variables in text strings?"), but
maybe I shouldn't be surprised because this is the first time I needed
to do that myself ;-).


So here is what I came up with - please try to shoot it down:


sub replace {
my ($string, $pattern, $replacement) = @_;

if (my @m = $string =~ m/$pattern/p) {
for my $i (1 .. @m) {
$replacement =~ s/\$$i(?=\D|)/$m[$i-1]/e;
}
return ${^PREMATCH} . $replacement . ${^POSTMATCH};
} else {
return $string;
}
}

All three strings are potentially untrusted ($string comes from a
database query, $pattern and $replacement are supplied by the user.

The function should do the same as
$string =~ s/$pattern/$replacement/;
except that $<number> references to capture buffers are resolved, too.

I think this is safe:

* The match itself should be safe unless "use re 'eval'" is active
(it isn't).
* The s///e only evaluates to $m[$i-1], it doesn't evaluate the
content, so it cannot be used to inject code.

There is one catch, though: If $pattern doesn't contain any capturing
parentheses, @m is set to (1) on a successful match, which isn't
distinguishable from a pattern which captures one string "1". I guess I
could try to analyze $pattern or just document that at least one set of
parentheses must be used.

Did I miss something? Is there a simpler way?

hp
 
S

sln

I want to provide non-trusted users with a way to make substitutions
similar to s///. To my surprise this isn't an FAQ (although there is a
similar problem - "How can I expand variables in text strings?"), but
maybe I shouldn't be surprised because this is the first time I needed
to do that myself ;-).


So here is what I came up with - please try to shoot it down:


sub replace {
my ($string, $pattern, $replacement) = @_;

if (my @m = $string =~ m/$pattern/p) {
for my $i (1 .. @m) {
$replacement =~ s/\$$i(?=\D|)/$m[$i-1]/e;
}
return ${^PREMATCH} . $replacement . ${^POSTMATCH};
} else {
return $string;
}
}

All three strings are potentially untrusted ($string comes from a
database query, $pattern and $replacement are supplied by the user.

The function should do the same as
$string =~ s/$pattern/$replacement/;
except that $<number> references to capture buffers are resolved, too.

I don't know what you are trying to resolve re:
"except that $<number> references to capture buffers are resolved, too."
Can you elaborate?

my ($Str,$Pat,$Rep) = (
"and \$this end",
"(\\\$this)",
"\$1\$1\$1"
);
print "\nStr: '$Str'\nPat: '$Pat'\nRepl:'$Rep'\n\n";
my $res = replace ($Str,$Pat,$Rep);
print "Str = '$res'\n";
---
Out:
Str: 'and $this end'
Pat: '(\$this)'
Repl:'$1$1$1'
Str = 'and $this$1$1 end'
---
Which I think is incorrect. It remains to be seen wheather
the above $Str,$Pat,$Rep can actually get in this form via
console though.

I think this is safe:

* The match itself should be safe unless "use re 'eval'" is active
(it isn't).
* The s///e only evaluates to $m[$i-1], it doesn't evaluate the
content, so it cannot be used to inject code.

There is one catch, though: If $pattern doesn't contain any capturing
parentheses, @m is set to (1) on a successful match, which isn't
distinguishable from a pattern which captures one string "1". I guess I
could try to analyze $pattern or just document that at least one set of
parentheses must be used.

Did I miss something? Is there a simpler way?

hp

I don't know about how safe this is. Croaking on code could be as bad as
injecting code.
Also, only very simple s&r can be done and with no modifiers.
Though the /p modifier and ${^PREMATCH} . $replacement . ${^POSTMATCH}
are a good start.

It would be easier if it were just for programmed usage, then its
at your own risk. Its hard to imagine this could be safely made robust
for command line usage. I could be wrong.

-sln
 
P

Peter J. Holzer

Quoth "Peter J. Holzer said:
I want to provide non-trusted users with a way to make substitutions
similar to s///. To my surprise this isn't an FAQ (although there is a
similar problem - "How can I expand variables in text strings?"), but
maybe I shouldn't be surprised because this is the first time I needed
to do that myself ;-).

So here is what I came up with - please try to shoot it down:


sub replace {
my ($string, $pattern, $replacement) = @_;

if (my @m = $string =~ m/$pattern/p) {
for my $i (1 .. @m) {
$replacement =~ s/\$$i(?=\D|)/$m[$i-1]/e;
}
return ${^PREMATCH} . $replacement . ${^POSTMATCH};

I don't think this behaviour is documented. That is, AIUI
${^{PRE,POST}MATCH} should be considered undefined (in the C sense, not
the Perl sense) after a match which didn't specify /p.

I think you are right. I'm not sure if this is intentional but the
wording in perlvar allows a match to clobber ${^{PRE,POST}MATCH} even if
no /p is present.

That's easily fixed.
$#+ gives the number of sets of capturing parens in the last successful
match.

Hmpf, I should have thought of that. Thanks.

hp
 
P

Peter J. Holzer

I want to provide non-trusted users with a way to make substitutions
similar to s///. To my surprise this isn't an FAQ (although there is a
similar problem - "How can I expand variables in text strings?"), but
maybe I shouldn't be surprised because this is the first time I needed
to do that myself ;-).


So here is what I came up with - please try to shoot it down:


sub replace {
my ($string, $pattern, $replacement) = @_;

if (my @m = $string =~ m/$pattern/p) {
for my $i (1 .. @m) {
$replacement =~ s/\$$i(?=\D|)/$m[$i-1]/e;
}
return ${^PREMATCH} . $replacement . ${^POSTMATCH};
} else {
return $string;
}
}

All three strings are potentially untrusted ($string comes from a
database query, $pattern and $replacement are supplied by the user.

The function should do the same as
$string =~ s/$pattern/$replacement/;
except that $<number> references to capture buffers are resolved, too.

I don't know what you are trying to resolve re:
"except that $<number> references to capture buffers are resolved, too."

Consider:
$string = 'foo';
$pattern = '(f)(o)';
$replacement = '$2$1';
$string =~ s/$pattern/$replacement/;
versus
$string =~ s/$pattern/$2$1/;

In the first case the result is '$2$1o', in the second it is 'ofo'. I
want the second behaviour, i.e., I want to replace $<number> patterns
with the contents of the corresponding capture buffers.

Can you elaborate?

my ($Str,$Pat,$Rep) = (
"and \$this end",
"(\\\$this)",
"\$1\$1\$1"
);
print "\nStr: '$Str'\nPat: '$Pat'\nRepl:'$Rep'\n\n";
my $res = replace ($Str,$Pat,$Rep);
print "Str = '$res'\n";
---
Out:
Str: 'and $this end'
Pat: '(\$this)'
Repl:'$1$1$1'
Str = 'and $this$1$1 end'

Right. There is a /g modifier missing.

But I noticed a more serious error. Evaluating the $<number> sequences
in numerical order allows double evaluation, e.g.,
replace('aaa$2bbb', '(.*(...))', '$1')
returns 'aaabbbbbb' instead of 'aaa$2bbb'.

The substitutions need to be performed in a strictly left to right
manner. Adding Ben's suggestions and support for \ escapes at the same
time I get:

sub replace2 {
my ($string, $pattern, $replacement) = @_;

if (my @m = $string =~ m/$pattern/p) {
my ($prematch, $postmatch) = (${^PREMATCH}, ${^POSTMATCH});
if ($#+) {
my $new = "";
for (split(/(\$\d+|\\.)/, $replacement)) {
if (/^\$(\d+)/) {
$new .= $m[$1-1];
} elsif (/\\(.)/) {
$new .= $1;
} else {
$new .= $_;
}
}
$replacement = $new;
}
return $prematch . $replacement . $postmatch;
} else {
return $string;
}
}

So it gets more complicated, not simpler as I hoped :-/.

It remains to be seen wheather the above $Str,$Pat,$Rep can actually
get in this form via console though.

I see no reason to disallow them.

I think this is safe:

* The match itself should be safe unless "use re 'eval'" is active
(it isn't).
* The s///e only evaluates to $m[$i-1], it doesn't evaluate the
content, so it cannot be used to inject code.

There is one catch, though: If $pattern doesn't contain any capturing
parentheses, @m is set to (1) on a successful match, which isn't
distinguishable from a pattern which captures one string "1". I guess I
could try to analyze $pattern or just document that at least one set of
parentheses must be used.

Did I miss something? Is there a simpler way?

I don't know about how safe this is. Croaking on code could be as bad as
injecting code.

It shouldn't croak either.
Also, only very simple s&r can be done and with no modifiers.

The only modifier I might need is /g - I could add that as an option.

Case conversion might also be useful, so maybe I'll add support for \U,
\L, etc. later. That should fit well into the inner if/elsif cascade.
Its hard to imagine this could be safely made robust
for command line usage.

I'm sure it can be made safe and I think it already is.

hp
 
U

Uri Guttman

PJH> Consider:
PJH> $string = 'foo';
PJH> $pattern = '(f)(o)';
PJH> $replacement = '$2$1';
PJH> $string =~ s/$pattern/$replacement/;
PJH> versus
PJH> $string =~ s/$pattern/$2$1/;

PJH> In the first case the result is '$2$1o', in the second it is 'ofo'. I
PJH> want the second behaviour, i.e., I want to replace $<number> patterns
PJH> with the contents of the corresponding capture buffers.

you can't do that directly with $replacement. a /e would probably do
it. why not parse out the $1 yourself and replace it with a substr using
indexes @- and @+? not too hard to do if you don't care about checking
for escaped $'s. even that isn't too much work.

PJH> sub replace2 {
PJH> my ($string, $pattern, $replacement) = @_;

PJH> if (my @m = $string =~ m/$pattern/p) {
PJH> my ($prematch, $postmatch) = (${^PREMATCH}, ${^POSTMATCH});
PJH> if ($#+) {
PJH> my $new = "";
PJH> for (split(/(\$\d+|\\.)/, $replacement)) {
PJH> if (/^\$(\d+)/) {
PJH> $new .= $m[$1-1];
PJH> } elsif (/\\(.)/) {
PJH> $new .= $1;
PJH> } else {
PJH> $new .= $_;
PJH> }
PJH> }
PJH> $replacement = $new;
PJH> }
PJH> return $prematch . $replacement . $postmatch;
PJH> } else {
PJH> return $string;
PJH> }

that all seems way too complicated to me. here is my above idea in VERY
rough logic:

do the match as you do now.

copy @- and @+ so you don't clobber them in the s///
parse the replacement for $1 and do an s/// on that with /e


$replacement =~ s/$(\d+)/substr( $orig_str, $beg[$1], $end[$1] )/eg ;

that's the idea. it needs cleaning up and testing. it really is a simple
templater when you look at it. so this brings up another idea. why not
create your own simpler syntax for the grabs and parse them out yourself
and do something like the above?

PJH> I'm sure it can be made safe and I think it already is.

my idea is safe as it never executes any user code, just does matching
and replacements later.

uri
 
S

sln

I don't know what you are trying to resolve re:
"except that $<number> references to capture buffers are resolved, too."

Consider:
$string = 'foo';
$pattern = '(f)(o)';
$replacement = '$2$1';
$string =~ s/$pattern/$replacement/;
versus
$string =~ s/$pattern/$2$1/;

In the first case the result is '$2$1o', in the second it is 'ofo'. I
want the second behaviour, i.e., I want to replace $<number> patterns
with the contents of the corresponding capture buffers.

But I noticed a more serious error.

The substitutions need to be performed in a strictly left to right
manner. Adding Ben's suggestions and support for \ escapes at the same
time I get:

sub replace2 {
my ($string, $pattern, $replacement) = @_;

if (my @m = $string =~ m/$pattern/p) {
my ($prematch, $postmatch) = (${^PREMATCH}, ${^POSTMATCH});
if ($#+) {
my $new = "";
for (split(/(\$\d+|\\.)/, $replacement)) {
if (/^\$(\d+)/) {
$new .= $m[$1-1];

$new .= $m[$1-1] // '';
Else that uninitialized value warning might show up.
// '' is an option or // $_
} elsif (/\\(.)/) {
$new .= $1;
} else {
$new .= $_;
}
}
$replacement = $new;
}
return $prematch . $replacement . $postmatch;
} else {
return $string;
}
}

So it gets more complicated, not simpler as I hoped :-/.
Yes, but this works fairly well.
I see no reason to disallow them.
I guess what I was saying was that you still have to deal with
interpolation from the console, even if its just single quote \\
sequence's.
Pattern and replacement \\ can get ugly in strings, especially the
even odd situation.

Its not like:
$repl = '\\\$1\$1';
$str =~ s/c/$repl/;
vs
$str =~ s/c/\\\$1\$1/;
Case conversion might also be useful, so maybe I'll add support for \U,
\L, etc. later. That should fit well into the inner if/elsif cascade.


I'm sure it can be made safe and I think it already is.

hp

I really meant the case that $Pattern is actually running
in a regex and may run afoul when the expression is parsed -
"Unmatched ( in regex; marked by <-- HERE in m/( <-- HERE"
Somehow this could be trapped? Don't know.

There are probably other gotcha's in this process.
But it looks like you taken this farther than I'd seen anybody
else. It would be nice to see native Perl support for this.
Replacement has always been an issue in the past.

-sln
 
P

Peter J. Holzer

PJH> Consider:
PJH> $string = 'foo';
PJH> $pattern = '(f)(o)';
PJH> $replacement = '$2$1';
PJH> $string =~ s/$pattern/$replacement/;
PJH> versus
PJH> $string =~ s/$pattern/$2$1/;

PJH> In the first case the result is '$2$1o', in the second it is 'ofo'. I
PJH> want the second behaviour, i.e., I want to replace $<number> patterns
PJH> with the contents of the corresponding capture buffers.

you can't do that directly with $replacement.

Yes, that's why I simulate it with explicit substitutions. I just
explained the *effect* I want to achieve.
a /e would probably do it.

Not directly.
why not parse out the $1 yourself

Actually, that's what I do.
and replace it with a substr using
indexes @- and @+? not too hard to do if you don't care about checking
for escaped $'s. even that isn't too much work.

PJH> sub replace2 {
PJH> my ($string, $pattern, $replacement) = @_;

PJH> if (my @m = $string =~ m/$pattern/p) {
PJH> my ($prematch, $postmatch) = (${^PREMATCH}, ${^POSTMATCH});
PJH> if ($#+) {
PJH> my $new = "";
PJH> for (split(/(\$\d+|\\.)/, $replacement)) {
PJH> if (/^\$(\d+)/) {
PJH> $new .= $m[$1-1];
PJH> } elsif (/\\(.)/) {
PJH> $new .= $1;
PJH> } else {
PJH> $new .= $_;
PJH> }
PJH> }
PJH> $replacement = $new;
PJH> }
PJH> return $prematch . $replacement . $postmatch;
PJH> } else {
PJH> return $string;
PJH> }

that all seems way too complicated to me. here is my above idea in VERY
rough logic:

do the match as you do now.

copy @- and @+ so you don't clobber them in the s///
parse the replacement for $1 and do an s/// on that with /e


$replacement =~ s/$(\d+)/substr( $orig_str, $beg[$1], $end[$1] )/eg ;

That can be simplified to

$replacement =~ s/$(\d+)/$m[$1-1])/eg ;

(except that it behaves differently for $0)

So that's a better solution to my original problem. But the addition of
\-escapes makes it a bit more complicated. I could use:

$replacement =~ s/$(\d+)|\\(.)/defined $1 ? $m[$1-1] : $2/eg;

but that's not much more simple than the explicit loop above and it's
not obvious how to extend that to handling \U and friends.

that's the idea. it needs cleaning up and testing. it really is a simple
templater when you look at it. so this brings up another idea. why not
create your own simpler syntax for the grabs and parse them out yourself
and do something like the above?

I try to avoid inventing my own syntax for functionality that already
exists. There are already way too many slightly incompatible search and
replace variants in the world. Unless I can come up with something which
is a lot better I'd like to stick to something which at least some of my
users are already familiar with.
PJH> I'm sure it can be made safe and I think it already is.

my idea is safe as it never executes any user code, just does matching
and replacements later.

Yup.

hp
 
P

Peter J. Holzer

if (my @m = $string =~ m/$pattern/p) { [...]
$new .= $m[$1-1];

$new .= $m[$1-1] // '';
Else that uninitialized value warning might show up.
// '' is an option or // $_

I think you should be warned if you try to reference an undefined
capture buffer. That's a feature, not a bug.

I guess what I was saying was that you still have to deal with
interpolation from the console, even if its just single quote \\
sequence's.

I don't know what console you are using, but mine doesn't interpolate
any printable characters[1].
Pattern and replacement \\ can get ugly in strings, especially the
even odd situation.

Its not like:
$repl = '\\\$1\$1';
$str =~ s/c/$repl/;
vs
$str =~ s/c/\\\$1\$1/;

This is a property of the Perl syntax. It doesn't have anything to do
with the console.

I really meant the case that $Pattern is actually running
in a regex and may run afoul when the expression is parsed -
"Unmatched ( in regex; marked by <-- HERE in m/( <-- HERE"
Somehow this could be trapped? Don't know.

That's not what I meant by "safe". In this case the user will get an
error message and needs to fix the problem. The program would be
"unsafe" it it allowed the user to do something which he isn't meant do
do (access data with somebody else's privilges, run arbitrary code, etc.
I probably should have used the word "secure" instead of "safe").

hp

[1] Actually, on HP-UX, serial consoles and telnet sessions start with
erase and kill set to '#' and '@' respectively. Which is highly
annoying, especially if you have used one of these characters in
your password, so don't do that ;-).
 
M

Martijn Lievaart

[1] Actually, on HP-UX, serial consoles and telnet sessions start with
erase and kill set to '#' and '@' respectively. Which is highly
annoying, especially if you have used one of these characters in
your password, so don't do that ;-).

Aha, one other mystery solved, thanks!

M4
 
S

sln

[snip]
Pattern and replacement \\ can get ugly in strings, especially the
even odd situation.

Its not like:
$repl = '\\\$1\$1';
$str =~ s/c/$repl/;
vs
$str =~ s/c/\\\$1\$1/;

This is a property of the Perl syntax. It doesn't have anything to do
with the console.
Yes, a property of Perl parsing. Passed as arguments I guess you only
have to deal with delimeters specific to that cli, so \\\$1\$1 as argument
will be Perl internal string representation if no chars are cli delimeters.
Windows cmd processor has a wierd mix of space and dbl quote delimeters.
Its always possible to get the values interractively after the program
starts. So yeah, the console is just a device.

Well done, it looks good.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top