# Construction of a non-regexable subset of the set of all strings

Discussion in 'Perl Misc' started by Dominic van der Zypen, Jun 23, 2006.

1. ### Dominic van der ZypenGuest

Hello,

I have a question which I am afraid might seem "overly scientific" (not
to say nutty), but which I find is quite natural after all.

If we look at the set S of all finite strings, then each regex
represents some subset of S. For instance, h(a|e)ndel is the set
consisting of handel and hendel; and the regex .+ represents the set
of all non-empty strings not containing \n.

I wondered whether there is a "non-regexable" subset N of S (i.e. a
subset N for which there is no regex representing N) and came up with
a (non-constructive) positive answer from set theory: there are
uncountably many subsets of S, but there are only countably many
regular expressions, so there are non-regexable subsets.

What I would like to know is whether one can construct a subset N of S
which is non-regexable (and whether it is hard to prove that N has the
desired property).

Many thanks in advance and best wishes, Dominic

Dominic van der Zypen, Jun 23, 2006

2. ### John W. KrahnGuest

Dominic van der Zypen wrote:
>
> I have a question which I am afraid might seem "overly scientific" (not
> to say nutty), but which I find is quite natural after all.
>
> If we look at the set S of all finite strings, then each regex
> represents some subset of S. For instance, h(a|e)ndel is the set
> consisting of handel and hendel; and the regex .+ represents the set
> of all non-empty strings not containing \n.
>
> I wondered whether there is a "non-regexable" subset N of S (i.e. a
> subset N for which there is no regex representing N) and came up with
> a (non-constructive) positive answer from set theory: there are
> uncountably many subsets of S, but there are only countably many
> regular expressions, so there are non-regexable subsets.
>
> What I would like to know is whether one can construct a subset N of S
> which is non-regexable (and whether it is hard to prove that N has the
> desired property).

Are you talking about regular expressions in general or only Perl's regular
expressions? If this is a generic regular expression question you may want to

John
--
use Perl;
program
fulfillment

John W. Krahn, Jun 23, 2006

3. ### David SquireGuest

Glenn Jackman wrote:
> At 2006-06-23 08:45AM, Dominic van der Zypen <> wrote:
>> Hello,
>>
>> I have a question which I am afraid might seem "overly scientific" (not
>> to say nutty), but which I find is quite natural after all.
>>
>> If we look at the set S of all finite strings, then each regex
>> represents some subset of S. For instance, h(a|e)ndel is the set
>> consisting of handel and hendel; and the regex .+ represents the set
>> of all non-empty strings not containing \n.
>>
>> I wondered whether there is a "non-regexable" subset N of S (i.e. a
>> subset N for which there is no regex representing N) and came up with
>> a (non-constructive) positive answer from set theory: there are
>> uncountably many subsets of S, but there are only countably many
>> regular expressions, so there are non-regexable subsets.
>>
>> What I would like to know is whether one can construct a subset N of S
>> which is non-regexable (and whether it is hard to prove that N has the
>> desired property).

>
> It seems to me that, for every string x in S, one can construct at least
> one regular expression to match it, i.e. ^x\$
>

Yes, but the set of all subsets of S is not the same as the set of
single strings in S (i.e. S).

DS

David Squire, Jun 23, 2006
4. ### Ted ZlatanovGuest

On 23 Jun 2006, wrote:

> I wondered whether there is a "non-regexable" subset N of S (i.e. a
> subset N for which there is no regex representing N) and came up with
> a (non-constructive) positive answer from set theory: there are
> uncountably many subsets of S, but there are only countably many
> regular expressions, so there are non-regexable subsets.

> What I would like to know is whether one can construct a subset N of S
> which is non-regexable (and whether it is hard to prove that N has the
> desired property).

(S: set of all finite strings)

I assume by "regexable" you mean "uniquely regexable", meaning that
the regular expression would not match S-N.

Any finite subset N of S would be regexable, since you can construct a
regular expression that does an OR of all the finite strings in N.

Some infinite subsets (e.g. all the strings that begin with "a") could
be regexable, but generally you can't uniquely match a random infinite
collection of strings with a regular expression, since there's nothing
for the regular expression to "latch" onto. I'm sure there's a more
rigorous analysis, perhaps using the Huffman distance, but I would say
that if a set is infinite and constructed without a specific set of
rules, it's not possible in finite time to *build* a regular
expression to match only that set.

Ted

Ted Zlatanov, Jun 23, 2006