topher67 said:
Let's assume the following:
* curlies are the only specials allowed
* the substrings inside the curlies can be of differing lengths
Ah, that makes it harder than I had hoped...
* there may be more than one expanded set in the input list
Do you mean like "abc{d,e,f}ghi{j,k,l}mn" where you have a cartesian join,
or do you mean like in your example below, where there is more than one
"lines" of pattern but any given one of them has at most one set of
curlies?
* we won't handle nested curlies (e.g. Foo{A{1,2,3}Z,XY}Bar )
Nesting actually probably wouldn't be so bad to implement, at least
compared to Cartesian joins. In fact, the example you give below is just a
special kind of nesting, equivalent to {Foo{ZZZ,Y,XX}Bar,Baz{11,222},Nop}.
A special kind because you can only have two levels, and the outer level
cannot have any fixed characters in before or after--but still it is
nested.
Here's another example:
FooZZZBar
FooYBar
FooXXBar
Baz11
Baz222
Nop
Becomes:
Foo{ZZZ,Y,XX}Bar
Baz{11,222}
Nop
I realize that this is a hard problem to solve. Any help is greatly
appreciated.
There are many possible solutions, and it is not obvious how to assign a
score to each so that we can choose a single best one. Also, once a
scoring system is designed, it maybe computationally expensive to achieve.
So some kind of heuristic is probably needed. In the example you give, the
best matching at the front (Foo) corresponds to the best matching at the
rear (Bar). Is that likely to be a common occurrence in your data, or was
it just a coincident?
Does Regexp::List come up with a regex which matches all of the given words
*and nothing else*? The docs didn't seem to address that issue.
Anyway, if your goal is condense, say, a large directory listing down to a
handful of patterns that human could easily discern, I'm not sure that
something optimized for a regex engine would do a good job. (Although
looking at the techniques used by it could certainly be informative.)
If this is for human consumption, I would have a preference for patterns
in which the curlies occur at natural boundaries, such as transitions
from letter to number or number to letter or punctuation to
non-punctuation, etc.
As someone who frequently looks at very long directory listings of
computer-generated file names, this is something I've often thought about,
but never actually attempted.
Xho