Help with dynamic regex


D

Dave Saville

I have a general purpose CGI script to interogate various log files on
my server. Because sometimes I want to limit what I see I have an
optional field where I can type a regular expression.

If the field is blank then obviously nothing special needs to be done,
but if there is something then I need to build a regex. If the field
is just for example abc then I need m/abc/ generated - or something
that will do that. OTH if the field has /abc/i then I need m/abc/i and
so on.

Things are further complicated in that if there is a regular
expression then I want that matching text highlighted in the output.
What I have follows - but it does not always work. For example if I
have /o/g it does not highlight all the o's.

sub getlog
{
my $logfile = shift;
open LOG, "<$logfile" or die "Can't open $logfile $!";

local $regex = $regex;

if ( $regex )
{
if ( $regex =~ m/^(.)(.*)\1([giomsx]+)$/ )
{
my $body = $2;
my $mod = $3;

if ( $mod )
{
$regex = qr/(?$mod)$body/;
}
else
{
$regex = qr/$body/;
}
}
}

while ( <LOG> )
{
s/\</&lt;/gm;
s/\>/&gt;/gm;
s/\xf8/&deg;/gm;

if ( $regex )
{
push @log, $_ if s/($regex)/\<span
class="searchlight"\>$1\<\/span\>/;
}
else
{
push @log, $_;
}
}

close LOG;
}

TIA
 
Ad

Advertisements

R

Rainer Weikusat

Dave Saville said:
I have a general purpose CGI script to interogate various log files on
my server. Because sometimes I want to limit what I see I have an
optional field where I can type a regular expression.

If the field is blank then obviously nothing special needs to be done,
but if there is something then I need to build a regex. If the field
is just for example abc then I need m/abc/ generated - or something
that will do that. OTH if the field has /abc/i then I need m/abc/i and
so on.

Things are further complicated in that if there is a regular
expression then I want that matching text highlighted in the output.
What I have follows - but it does not always work. For example if I
have /o/g it does not highlight all the o's.

sub getlog
{
my $logfile = shift;
open LOG, "<$logfile" or die "Can't open $logfile $!";

local $regex = $regex;

A general remark: Consider using my for local variables of
subroutines. It is generally faster and these variables are then
private to the subroutine which defines them while the local binding
will be visible to all subroutines invoked from this one. Also, using
a parameter to pass data into subroutines is usually sensible because
the subroutine then becomes independent of its environment and it is
easier to see what data flows into it (ie, without reading through all
of the code). Here, this would mean using something like

sub getlog
{
my ($logfile, $regex) = @_;
if ( $regex )
{
if ( $regex =~ m/^(.)(.*)\1([giomsx]+)$/ )
{
my $body = $2;
my $mod = $3;

if ( $mod )
{
$regex = qr/(?$mod)$body/;
}
else
{
$regex = qr/$body/;
}
}
}

while ( <LOG> )
{
s/\</&lt;/gm;
s/\>/&gt;/gm;
s/\xf8/&deg;/gm;

if ( $regex )
{
push @log, $_ if s/($regex)/\<span
class="searchlight"\>$1\<\/span\>/;
}

According to perlre(1) qr/// doesn't support the g modifier and if you
want 'global substituations' (in a particular line of input), you need
to apply that to s///. A possible way to achieve that would be to
create a subroutine which performs the desired modification on the input
line and invoke that for each line. Example how this could be done:

------------
my $r = '|g|g'; # set to regex
my $s;

if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
$2, $3));
} else {
$s = sub { };
}

$a = 'gagbgcgd';
$s->($a);

print $a, "\n";
 
R

Rainer Weikusat

[...]
if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
$2, $3));

Additional remark: Because this is based on interpolating a string
into some code, it will be possible to use this to execute arbitrary
code by providing suitable input to the script, eg,

my $r = '|a)//; print "Gotcha!\n"; s/(b|';

will print Gotcha! whenever the substituation routine is
executed. OTOH, since regexes support executing code anyway, this
might not be much of a concern.
 
D

Dave Saville

On Wed, 7 Mar 2012 17:37:58 UTC, Rainer Weikusat

A general remark: Consider using my for local variables of
subroutines. It is generally faster and these variables are then
private to the subroutine which defines them while the local binding
will be visible to all subroutines invoked from this one. Also, using
a parameter to pass data into subroutines is usually sensible because
the subroutine then becomes independent of its environment and it is
easier to see what data flows into it (ie, without reading through all
of the code). Here, this would mean using something like

Noted.

According to perlre(1) qr/// doesn't support the g modifier and if you

That explains a lot :)
want 'global substituations' (in a particular line of input), you need
to apply that to s///. A possible way to achieve that would be to
create a subroutine which performs the desired modification on the input
line and invoke that for each line. Example how this could be done:

------------
my $r = '|g|g'; # set to regex
my $s;

if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
$2, $3));
} else {
$s = sub { };
}

$a = 'gagbgcgd';
$s->($a);

print $a, "\n";
------------

Sorry you have lost me there. Apart from it not doing what I need if
the regex is just supplied as "abc" I do not understand that eval -
obviously it runs the sprintf but what is that sub all about or the $s
= sub{} ?

I am guessing that $s gets set to *something* that $s->($a) passes $a
to - but how does that work?

Thanks for the help.
 
D

Dave Saville

[...]
if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
$2, $3));

Additional remark: Because this is based on interpolating a string
into some code, it will be possible to use this to execute arbitrary
code by providing suitable input to the script, eg,

my $r = '|a)//; print "Gotcha!\n"; s/(b|';

will print Gotcha! whenever the substituation routine is
executed. OTOH, since regexes support executing code anyway, this
might not be much of a concern.

No concern at all in this case as it's just me. :)
 
R

Rainer Weikusat

Dave Saville said:
]

------------
my $r = '|g|g'; # set to regex
my $s;

if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
$2, $3));
} else {
$s = sub { };
}

$a = 'gagbgcgd';
$s->($a);

print $a, "\n";
------------

Sorry you have lost me there. Apart from it not doing what I need if
the regex is just supplied as "abc"

It expects that all regexes start and end with a delimiter-character.
It might make more sense to use that in the function template, eg

sprintf('sub { $_[0] =~ s%s(%s)%s/XX\1XX%s%s; }',
$1, $2, $1, $1, $3);

since there has to be some kind of delimiter and it must not occur
unescaped in the 're text' itself (so the burden of picking something
suitable is on the user). Otherwise, some kind of escaping would need
to be added.
I do not understand that eval -
obviously it runs the sprintf but what is that sub all about or the $s
= sub{} ?

sub { } just created an anonymous subroutine which does nothing and
returns a coderef pointing to that. The 'eval' compiles and runs the
code contained in the string passed to it and returns the result. In
this case, that's a references to an anonymous subroutine which does
the intended pattern substituation (subject to the limitations
mentioned elswhere).
 
Ad

Advertisements

D

Dave Saville

Dave Saville said:
]

------------
my $r = '|g|g'; # set to regex
my $s;

if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
$2, $3));
} else {
$s = sub { };
}

$a = 'gagbgcgd';
$s->($a);

print $a, "\n";
------------

Sorry you have lost me there. Apart from it not doing what I need if
the regex is just supplied as "abc"

It expects that all regexes start and end with a delimiter-character.
It might make more sense to use that in the function template, eg

sprintf('sub { $_[0] =~ s%s(%s)%s/XX\1XX%s%s; }',
$1, $2, $1, $1, $3);

since there has to be some kind of delimiter and it must not occur
unescaped in the 're text' itself (so the burden of picking something
suitable is on the user). Otherwise, some kind of escaping would need
to be added.
I do not understand that eval -
obviously it runs the sprintf but what is that sub all about or the $s
= sub{} ?

sub { } just created an anonymous subroutine which does nothing and
returns a coderef pointing to that. The 'eval' compiles and runs the
code contained in the string passed to it and returns the result. In
this case, that's a references to an anonymous subroutine which does
the intended pattern substituation (subject to the limitations
mentioned elswhere).

Thanks again - my brain hurts. :)
 
R

Rainer Weikusat

Ben Morrow said:
Quoth "Dave Saville said:
if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
$2, $3));

Additional remark: Because this is based on interpolating a string
into some code,

...it is a completely stupid thing to do in a CGI script.

Depending on the purpose of the script, ie, is it only going to be
used by people who control the machine, anyway, as was the case here,
the input valid needs to be sanitized in some suitable way in order to
prevent code execution. Since this might not be self-evident to
everyone (aka 'SQL injection?'), I wrote the additional posting
pointing this out. Consequently, you assertion that I wrote this
because I would be 'too stupid' to understand the consequence, while
presumably being your honest conviction, does not hold water.

[...]
Regexes are protected against interpolated sections containing (?{}) and
the like. See the documentation for 'use re "eval"'.

This is not applicable here, cf

when "use re 'eval'" is in effect, a regex is allowed to
contain "(?{ ... })" zero-width assertions even if regular
expression contains variable

interpolation. That is normally disallowed, since it is a
potential security risk. Note that this pragma is ignored when
the regular expression is obtained from tainted data, i.e.
evaluation is always disallowed with tainted regular
expressions. See "(?{ code })"
in perlre.


For the purpose of this pragma, interpolation of precompiled
regular expressions (i.e., the result of "qr//") is not
considered variable interpolation.
That is the *wrong* attitude.

Nope. That's the appropriate attitude: Security isn't about 'voodoo
dancing for the sake of it' but about preventing some kind of
privilege escalation from happening. And 'requiring privilege' in
order to be able to perform the operation at all is a completely valid
strategy to achieve that: It is conceptually identical to having a
'privileged' system account of some kind.
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Rainer Weikusat said:
my $r = '/g/g';
my ($s, $mod);

if ($r) {
if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$r = $2;
$mod = $3;
}

$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
qr/$r/, $mod));

Clever. Of course, eval is lexically scoped within the surrounding
block, so there's no need to compile the regex twice:

my $qr = qr/$r/;
$s = eval sprintf 'sub { $_[0] =~ s/$qr/XX\1XX/%s; }', $mod;

This is a regex with a variable dynamically interpolated into it, IOW,
the $qr is evaluated every time the sub is executed, as opposed to
once when it is compiled. Since this is not good for anything, it
should rather be avoided.
I might even consider that safe enough to use, given that the
interpolated section is now entirely predictable, except that there's no
need for eval at all. The only modifers you are accounting for that
can't be put in a (?X) are /g and /o, and /o isn't useful anyway, so
there's really nothing wrong with

my $g = $mod =~ s/g//g;
if ($g) {
$a =~ s/(?$mod)$r/XX$1XX/g;
}
else {
$a =~ s/(?$mod)$r/XX$1XX/;
}

and it's a whole lot harder to get wrong.

It is more code (consequently, it is easier to get wrong) and it
requires a conditional test whose outcome is always either one way or
the other way for each line of input processed. And log files can
easily be huge. Not to mention that it is also prone to so-called
'update anomalies' since there are now two replacement expressions
which need to be kept in sync manually.

[...]
Regexes are protected against interpolated sections containing (?{}) and
the like. See the documentation for 'use re "eval"'.

This is not applicable here, cf
[...]

For the purpose of this pragma, interpolation of precompiled
regular expressions (i.e., the result of "qr//") is not
considered variable interpolation.

The input to qr// is considered variable interpolation, though,
meaning, ordinarily, (?{...}) won't work.

...which was my point. Interpolating into a regex is safe (modulo DoS
attacks using everlasting patterns) in a way eval can never be.

There's no problem with eval per se, just with the string which is
actually evaluated. Consequently, there is some (possibly absent)
input validation/ sanitization code and this code can either be
correct, then, evaluating the input is 'safe;', or incorrect and then, it
isn't provided 'evaluating user input' isn't intentionally done in
order to enable users to execute whatever Perl code might be useful
for them. I've done this at least once in the past, with a script
which was not intended to be a security boundary, where users of the
script could use an arbitrary Perl expression in order to provide 'a
size argument'. The idea was that it might be convenient to use a Perl
expression calculating the actual size instead of having to do that in
some separate way before invoking the script.
 
R

Rainer Weikusat

Ben Morrow said:
Quoth Rainer Weikusat said:
Ben Morrow said:
my $qr = qr/$r/;
$s = eval sprintf 'sub { $_[0] =~ s/$qr/XX\1XX/%s; }', $mod;

This is a regex with a variable dynamically interpolated into it, IOW,
the $qr is evaluated every time the sub is executed, as opposed to
once when it is compiled. Since this is not good for anything, it
should rather be avoided.

No it isn't. Run it under re "debug" to see. Since the LHS of the s///
is nothing but a qr//, perl doesn't recompile the regex, it just uses
the compiled version in the qr//. (That's the whole point of qr//.)

In fact, perl is cleverer than that. Even if you were to write

my $r = "f";
$s = eval 'sub { /XX $r/ }';

perl would still only compile the pattern once. It tracks the variables
used by the pattern, and only recompiles it if they've changed. (This is
why /o is not useful any more.)

The most-recent perl I'm using is 5.10.1 and it still documents this
as

PATTERN may contain variables, which will be interpolated (and
the pattern recompiled) every time the pattern search is
evaluated, except for when the delimiter is a single quote.
(Note that $(, $), and $| are not interpolated because they
look like end-of-string tests.) If you want such a pattern to
be compiled only once, add a "/o" after the trailing
delimiter.

Even assuming the additional checks you mentioned above (since when to
these exist), that still means execution-time overhead which can
easily be avoided.
Your code,

$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
qr/$r/, $mod));

OTOH, compiles the pattern twice. First it is compiled into a qr//, then
that qr// is stringified and interpolated into a larger string, then
that larger string is evaled. The eval will have to compile the pattern
again, since it's lost any connection to the first qr//.

Well, yes. But since this is executed only once while the substitution
routine is possibly called millions (if not billions) of times, that
seems like a worthwhile tradeoff to me.

[...]

Yes, for those of us who are perfect programmers (and who can be certain
everyone maintaining the code after them will be too), eval can be
terribly convenient. For those of us who live in the real world, it
isn't safe.

"It isn't safe" means nothing except "Be afraid of it! You really
ought to!". Nothing which can be misused is safe from being misused
and 'in the real world' people who are anything but 'perfect
programmers' have created a whole new class of exploits by
interpolating userdata into SQL statement templates without proper
quoting (so-called 'SQL injection'). And the solution to this problem
is not 'trying to scare people away from using this technique' (that's
not going to work for the people who created the problem, anyway,
because their main objective is usually "get shit done fast") but to
teach them how to avoid interpolation into strings when it isn't
needed (eg, when a parametrized query can be used instead) and how to
do it safely when it is needed.

Granted, it would be nicer when Perl had some structured
representation of its own code which could be used for runtime code
generation but that gets us deeply into 'oatmeal with finger nail
clippings' territory :) (and IMHO, the lisp guys went off the deep end
with CLOS ...).
A CGI script (or anything else which talks to the network) should always
be considered a security boundary. That's part of my point.

And this point is wrong: Eg, a CGI script which is supposed to be used
by authenticated users to make configuration changes to some system
they control is not a security boundary.
 
D

Dave Saville

Quoth "Dave Saville said:
if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
$2, $3));

Additional remark: Because this is based on interpolating a string
into some code,

...it is a completely stupid thing to do in a CGI script. This is,
unforunately, a case which would have slipped by taint mode, but it's
still ridiculously dangerous.

So given the original problem - Search remote log files for some
arbitary regex - how would you go about it? SSH to the box and grep is
not an acceptable solution. Much easier in a browser where the log
files are predefined ( one does not have to remember *where* they are
- not all in /var/log) and additionally you have the browsers search
and highlight to play with.
Regexes are protected against interpolated sections containing (?{}) and
the like. See the documentation for 'use re "eval"'.


That is the *wrong* attitude. If it's a CGI script you *must* assume Bad
People will get their hands on it, no matter how 'internal' you think
the webserver serving it is. You should be using taint mode and writing
with security in mind.

Given that it is my system and that particular script is on a non
standard port that is shut on the firewall I think that is very
internal :) My server *always* runs taint mode *and* I check the
logs.

As a general point, if you need to provide a search to end users,
forgeting my requirement to type a regex, how would you sanitise the
provided search? ie just a simple string search but that would be
implemented by a regex.
 
Ad

Advertisements

J

John W. Krahn

Ben said:
Quoth Rainer Weikusat said:
my $r = '/g/g';
my ($s, $mod);

if ($r) {
if ($r =~ /(.)(.*)\1([giomsx]*)$/) {
$r = $2;
$mod = $3;
}

$s = eval(sprintf('sub { $_[0] =~ s/(%s)/XX\1XX/%s; }',
qr/$r/, $mod));

Clever. Of course, eval is lexically scoped within the surrounding
block, so there's no need to compile the regex twice:

my $qr = qr/$r/;
$s = eval sprintf 'sub { $_[0] =~ s/$qr/XX\1XX/%s; }', $mod;

You are still compiling the regexp twice. Also there is no need to use
sprintf and the use of \1 in the replacement string should be $1:

$s = eval "sub { \$_[0] =~ s/$r/XX\${1}XX/$mod; }";



John
 
R

Rainer Weikusat

Dave Saville said:
[...]

As a general point, if you need to provide a search to end users,
forgeting my requirement to type a regex, how would you sanitise the
provided search? ie just a simple string search but that would be
implemented by a regex.

Use qr//, cf

This operator quotes (and possibly compiles) its STRING
as a regular expression. STRING is interpolated the
same way as PATTERN in "m/PATTERN/". If "'" is used as
the delimiter, no interpolation is done.

The qr actually means 'quote regex' and judging from the 'and possibly
compiles' quoted above, it was originally meant to be used for
quoting.
 
R

Rainer Weikusat

John W. Krahn said:
Ben Morrow wrote:
[...]
$s = eval sprintf 'sub { $_[0] =~ s/$qr/XX\1XX/%s; }', $mod;

You are still compiling the regexp twice. Also there is no need to
use sprintf and the use of \1 in the replacement string should be $1:

$s = eval "sub { \$_[0] =~ s/$r/XX\${1}XX/$mod; }";

The \1 on the RHS came from the originally posted code. According to
perlre(1),

This is grandfathered for the RHS of a substitute to avoid
shocking the sed addicts, but it's a dirty habit to get into.

So, if somebody wants to use it in his own code, why not? The sprintf
came from me. I usually prefer that to doing variable interpolations
because it clearly separates the template string and the argument
interpolated into it.
 
Ad

Advertisements

D

Dave Saville

Thanks Ben and Rainer for your help and explanations. Working to my
satisfaction now :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top