Regular Expression Generator

J

jeremyje

Is there a library or a way to generate an appropriate regular
expression for any given input string?
(remove quotes for examples)
For example: "1234567890abcdef is in hex9"
Regex Generator returns: [0-9|A-F]{16} [a-z]{2} [a-z]{2} [0-9|a-z]{3}

Or anything that does some sort of similar processing?
 
J

Josef Moellers

Is there a library or a way to generate an appropriate regular
expression for any given input string?
(remove quotes for examples)
For example: "1234567890abcdef is in hex9"
Regex Generator returns: [0-9|A-F]{16} [a-z]{2} [a-z]{2} [0-9|a-z]{3}

Or anything that does some sort of similar processing?

Hardly.
First of all, your example is incorrect: "[0-9|A-F]{16}" will not match
"1...abcdef".
Second, The following RE will also match:
"1234567890abcdef is in hex9" as will
"[0-9a-z]{16} [0-9a-z]{2} [0-9a-z]{2} [0-9a-z]{3}" as will
".{16} .{2] .{2} .{4}" as will
".*\s.*\s.*\s.*" as will
"\S+\s+\S+\s+\S+\s+\S+"

IOW There is no single "appropriate regular expression" but infinitly
many (or some number close to infinity) that it's unpractical.
 
X

Xicheng Jia

J

Jürgen Exner

Is there a library or a way to generate an appropriate regular
expression for any given input string?
(remove quotes for examples)
For example: "1234567890abcdef is in hex9"
Regex Generator returns: [0-9|A-F]{16} [a-z]{2} [a-z]{2} [0-9|a-z]{3}

Or anything that does some sort of similar processing?

Well, yes, sure: actually the desired RE is a constant: .*
For a more advanced RE you can even quantify it with the length of the
string.

Seriously: it is impossible to derive a generic RE pattern from a single
text sample.

And you provided the point in case: why are you scanning for [a-f] in the
first part (I assume the upper case is a mistake, otherwise the RE wouldn't
match anyway) but for a-z in the second part? Shouldn't that be [is] or
maybe /is/? Without knowing the generic pattern it is impossible to know
what RE you me be looking for.

Jue
 
D

Dr.Ruud

(e-mail address removed) schreef:
Is there a library or a way to generate an appropriate regular
expression for any given input string?
(remove quotes for examples)
For example: "1234567890abcdef is in hex9"
Regex Generator returns: [0-9|A-F]{16} [a-z]{2} [a-z]{2} [0-9|a-z]{3}

Or anything that does some sort of similar processing?

I once created a Visual Basic-function that derived a mask from the
lines of a file. All the lines were supposed to have the same length,
and all characters were printable, so that made it a lot easier.

It would return a string of the same length. Special character values
were used for character sets, like 0x01 for [A-Z], 0x02 for [a-z], 0x03
for [A-Za-z], 0x04 for [0-9], 0x05 for [0-9A-Z], 0x07 for [0-9A-Za-z],
etc. It even recognized EBCDIC-numericals. It could also show a '@' for
alpha and a '#' for numeric.

A graphical character like ',' would mean that all lines in the file had
a ',' in that position. All in all it was very handy to get a quick idea
of what a fixed record file was about.
 
T

Ted Zlatanov

On 26 Jun 2006, (e-mail address removed) wrote:

Seriously: it is impossible to derive a generic RE pattern from a single
text sample.

I think this is incorrect, Jurgen. The OP was asking about an
appropriate, not a generic regex. Other than
http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/Optimizer.pm
(which I mentioned in c.l.p.modules to answer his post, before I saw
his cross-post here), you can always just say

my $regex = '^(' . join('|', @strings) . ')$';

and that's a regex that will match any given non-empty strings.

Ted
 
D

Dr.Ruud

Ted Zlatanov schreef:
my $regex = '^(' . join('|', @strings) . ')$';

and that's a regex that will match any given non-empty strings.

'^(?:' . join( '|', map quotemeta, grep /./, @strings ) . ')$'
 
A

Ala Qumsieh

Dr.Ruud said:
Ted Zlatanov schreef:


'^(?:' . join( '|', map quotemeta, grep /./, @strings ) . ')$'

This solution has a caveat. Regexps have a maximum length (65539 bytes I
believe). If you have enough strings in @strings (or if they are long
enough), then the compiled regexp can exceed this length, and error out. I
encountered this once, and the solution I resorted to was to construct an
anonymous sub on the fly:

my $string = <<EOS;
sub {
local \$_ = shift;
return 1 if /\Q$string[0]\E/;
return 1 if /\Q$string[1]\E/;
....
}
EOS

my $matches = eval $string;

Then use this anon sub to match:

if ($matches->($myString)) { ... }

--Ala
 
D

Dr.Ruud

Ala Qumsieh schreef:
Dr.Ruud:

This solution has a caveat. Regexps have a maximum length (65539
bytes I believe). If you have enough strings in @strings (or if they
are long enough), then the compiled regexp can exceed this length,
and error out. I encountered this once, and the solution I resorted
to was to construct an anonymous sub on the fly:

If so, it would have the same problem, because any of the strings can be
too long.

perl -Mwarnings -le '
$n = 1_000_000 ;
$_ = ".." x $n ;
$r = qr/^\Q$_\E$/ ;
print length($r), ":", /$r/ ;
'

prints 4000011:1
 
J

Jürgen Exner

Ted said:
On 26 Jun 2006, (e-mail address removed) wrote:




I think this is incorrect, Jurgen. The OP was asking about an
appropriate, not a generic regex. Other than
http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/Optimizer.pm
(which I mentioned in c.l.p.modules to answer his post, before I saw
his cross-post here), you can always just say

my $regex = '^(' . join('|', @strings) . ')$';

and that's a regex that will match any given non-empty strings.


True. As will /.+/. And the other extreme is /\Q$string\E/.

Chances are the OP was looking for neither of those 'solution' but for
something in between.
But where the right 'in between' can be found that is something you cannot
decide based on a single sample.

jue
 
T

Ted Zlatanov

This solution has a caveat. Regexps have a maximum length (65539 bytes I
believe). If you have enough strings in @strings (or if they are long
enough), then the compiled regexp can exceed this length, and error out. I
encountered this once, and the solution I resorted to was to construct an
anonymous sub on the fly:

You and Dr. Ruud make great points. My original code was written in
haste, sorry about that. If I did it with some brainwaves active, it
would have been:

# untested
my %hash;
$hash{$_} = 1 foreach @strings;
sub matches { return exists $hash{shift()};}

No need for subroutines and eval(). Then you can use matches() in the
regex as a code escape :) Isn't Perl great?

Ted
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top