Regular Expression finder

J

Joe Smith

Hi,

does anyone know of a tool that would be able to extract the regular
expression that corresponds to a set of Strings?

For instance:

This tool, given
"abc", "aec", "akkc"
would return a regular expression like "a.+c"

Is this possible? Is it done?

Thanks!
 
D

David Hilsee

Joe Smith said:
Hi,

does anyone know of a tool that would be able to extract the regular
expression that corresponds to a set of Strings?

For instance:

This tool, given
"abc", "aec", "akkc"
would return a regular expression like "a.+c"

Is this possible? Is it done?

_The_ regular expression? There are an infinite number of regular
expressions that match those strings. Even if there were a tool that could
guess at a regex using heuristics, you'd still need to examine its output to
ensure that its result meets your needs.

Personally, I'd prefer using something that can quickly test the regexes
that your brain comes up with. The Komodo IDE had such a feature that I
found quite helpful. I haven't seen anything like it in other IDEs, though.
 
M

Michael Borgwardt

Joe said:
does anyone know of a tool that would be able to extract the regular
expression that corresponds to a set of Strings?

There is no "the" there.
For instance:

This tool, given
"abc", "aec", "akkc"
would return a regular expression like "a.+c"

Why not "a[bek].*" or "a.*"?
Is this possible? Is it done?

It's certainly possible (and very easy) to write a method to
return a regular expression that matches any of a given set of
Strings:

public String getRegexp(String[] strings){
return ".*";
}

Or did you mean a regexp that matches all of the given Strings
and *only* those? The example you give fails in that regard, but
it's also quite easy to do:

public String getRegexp(String[] strings){
StringBuffer result = new StringBuffer("(");
for(int i=0; i<strings.lenght; i++){
result.append(strings+"|");
}
result.setCharAt(result.length()-1, ')');
return result.toString();
}

(you'd have to add escape sequences for characters that have
meaning in regexps)

The real question is: which if the *infinite* number of regular expressions
that matches a given set of Strings do you want to find?
 
J

Joe Smith

does anyone know of a tool that would be able to extract the regular
expression that corresponds to a set of Strings?

There is no "the" there.
For instance:

This tool, given
"abc", "aec", "akkc"
would return a regular expression like "a.+c"

Why not "a[bek].*" or "a.*"?


The real question is: which if the *infinite* number of regular expressions
that matches a given set of Strings do you want to find?

Ok, ok... it's clear that my idea needs more explanations:

It's true that there's an infinite number of regexps that may match a set of
Strings... So perhaps, what I really want is to extract the common sections
of these strings... And replace the other parts with the "minimum" regexp...
And yes, there will be countless of them!!...
Idea:

"header body1 body2 footer epilogue"

"Prolog header body1 footer"

I would have something like: "(Prolog)? header body1 (body2)? footer
(epilogue)?"

For instance, "diff" is able to find the differences between two files...
The tool I'm thinking off would perform diffs on several inputs, to be able
to extract these common parts...

But well, I guess it's too "abstract" for a program.

Thanks anyway!!
 
M

Matt Humphrey

Joe Smith said:
does anyone know of a tool that would be able to extract the regular
expression that corresponds to a set of Strings?

There is no "the" there.
For instance:

This tool, given
"abc", "aec", "akkc"
would return a regular expression like "a.+c"

Why not "a[bek].*" or "a.*"?


The real question is: which if the *infinite* number of regular expressions
that matches a given set of Strings do you want to find?

Ok, ok... it's clear that my idea needs more explanations:

It's true that there's an infinite number of regexps that may match a set of
Strings... So perhaps, what I really want is to extract the common sections
of these strings... And replace the other parts with the "minimum" regexp...
And yes, there will be countless of them!!...
Idea:

"header body1 body2 footer epilogue"

"Prolog header body1 footer"

I would have something like: "(Prolog)? header body1 (body2)? footer
(epilogue)?"

For instance, "diff" is able to find the differences between two files...
The tool I'm thinking off would perform diffs on several inputs, to be able
to extract these common parts...

But well, I guess it's too "abstract" for a program.

This is a research area, particular in user interfaces. You may find
something useful here:
http://www.ics.uci.edu/~dhilbert/papers/EDEM-UCI-ICS-98-13.pdf in section
4.4

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
S

sks

David Hilsee said:
_The_ regular expression? There are an infinite number of regular
expressions that match those strings. Even if there were a tool that could
guess at a regex using heuristics, you'd still need to examine its output to
ensure that its result meets your needs.

Personally, I'd prefer using something that can quickly test the regexes
that your brain comes up with. The Komodo IDE had such a feature that I
found quite helpful. I haven't seen anything like it in other IDEs,
though.

There's a plug in for Eclipse, you'd have to search for it on google though.
 
C

Carl Howells

Michael said:
public String getRegexp(String[] strings){
StringBuffer result = new StringBuffer("(");
for(int i=0; i<strings.lenght; i++){
result.append(strings+"|");
}
result.setCharAt(result.length()-1, ')');
return result.toString();
}

(you'd have to add escape sequences for characters that have
meaning in regexps)


Last I checked, the java regex engine is pretty bad for that... It uses
recursion to build the automaton used for matching, which recurses too
deeply on an alternation with a few thousand options, throwing an exception.
 
M

Michael Borgwardt

Carl said:
Last I checked, the java regex engine is pretty bad for that... It uses
recursion to build the automaton used for matching, which recurses too
deeply on an alternation with a few thousand options, throwing an
exception.

It wasn't really meant as a serious suggestion. *any* Regexp engine would
be a waste of resources to process that kind of pattern.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top