Inverted RegEx on list of numbers

Stephan Mann · Feb 13, 2008

Hi!

I'm relatively new to Perl and I'm not a RegEx guru either. I'd like to
understand, why the following code does not work as I expect.
Obviously, I'm oblivious to some detail of how the RegEx engine works.

my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";

Output:

one three
123:46
123:44 123:46

I'm completely lost as to why the second grep doesn't work like the first
one. The colon doesn't seem to be the problem, since the behavior stays
the same without it. How or why are numbers handled differently?!
The third grep works, but it forces me to handle this in a separate
RegEx which I can't extend if a want to do more with one RegEx.

tia, stephan

PS: This is my first post to a non-test group with slrn. Please let me
know if there is anything wrong with my post.

Damian Lukowski · Feb 13, 2008

Stephan said:
Hi!

I'm relatively new to Perl and I'm not a RegEx guru either. I'd like to
understand, why the following code does not work as I expect.
Obviously, I'm oblivious to some detail of how the RegEx engine works.

my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";

[^(two)] is a character class, meaning a character which is neither '('
nor ')', nor 't', nor 'w', nor 'o', nor ')'.

"one" matches by /[^(two)]/, because there is a character 'n', which is
matched by [^(two)]. The same holds for 'h' in "three".

On the other hand, there is no character in "123:44" which is not an
'1', not an '2', and so on.

Achim Peters · Feb 13, 2008

Stephan said:
I'm relatively new to Perl and I'm not a RegEx guru either. I'd like to
understand, why the following code does not work as I expect.
Obviously, I'm oblivious to some detail of how the RegEx engine works.

You have a mere RE "problem". There is nothing perl specific in your
observations. I recommend reading more about RegExps. (Since your
problem is not perl related, any documentation about REs will do)

my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";

Output:

one three
123:46
123:44 123:46

I'm completely lost as to why the second grep doesn't work like the first
one.

The second grep _does_ work like the first one. Both just not the way
you expect them to work. ;-)

How or why are numbers handled differently?!

Numbers per se are _not_ handled differently.

"[]" matches a single (one!) character, no matter how many characters
you list in between the "[" and the "]". A character of the expression
to be "grepped" matches the "[]" if and only if it is listed within the
"[]" (ranges and classes of matching characters are possible), given the
first character after the "[" is not a "^", or respectively matches if
and only if it is *not* listed, given the first character after the "["
is a "^".

So, it doesn't matter, whether you write
grep { /[^(two)]/ }
or
grep { /[^otw()]/ }

In both cases the expression will match _a_ _single_ _character_ which
is neither "o" nor "t" nor "w" nor "(" nor ")". Parentheses have no
special meaning within [].

'one' has a character which fulfills this expression: "n" ('one' has
even one more character, which does, but since your regex consisted only
of the [] the list item is already matched by the one matching character.
'two' does not have any such character (one that is not a "t" nor a "w"
nor an "o" nor a "(" nor a ")"
'three' has the "h" --> match (and furthermore the "r" and the "e".

I hope, you see by now, why
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
worked the way it worked.

Bye
Achim

Stephan Mann · Feb 13, 2008

Stephan said:
Stephan said:

my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";

Click to expand...

[^(two)] is a character class, meaning a character which is neither '('
nor ')', nor 't', nor 'w', nor 'o', nor ')'.

"one" matches by /[^(two)]/, because there is a character 'n', which is
matched by [^(two)]. The same holds for 'h' in "three".

On the other hand, there is no character in "123:44" which is not an
'1', not an '2', and so on.

Thank you very much, Damian. I was under the impression that the ()
inside the character class would be recognized and concatenate the
characters to a string (If I ever find this website again... Grr!).

After yet another hour of reading it became clear to me that there is no
way to write RegEx like "Match anything but this string" other than with
look-ahead/-behind. So my solution for now is

print join(" ", grep { /^(?!123:45)/ } @bar), "\n";

Hopefully this now works as intended, not only by accident.

greetings, stephan

Stephan Mann · Feb 13, 2008

You have a mere RE "problem". There is nothing perl specific in your
observations. I recommend reading more about RegExps. (Since your
problem is not perl related, any documentation about REs will do)

The !/../ solution is Perl specific, isn't it? But of course you are
right, I got the RegEx completely wrong. Thank you and Abigail for the
detailed explanation.

greetings, stephan

John W. Krahn · Feb 13, 2008

Stephan said:
Stephan said:

my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";

Click to expand...

[^(two)] is a character class, meaning a character which is neither '('
nor ')', nor 't', nor 'w', nor 'o', nor ')'.

"one" matches by /[^(two)]/, because there is a character 'n', which is
matched by [^(two)]. The same holds for 'h' in "three".

On the other hand, there is no character in "123:44" which is not an
'1', not an '2', and so on.

Click to expand...

Thank you very much, Damian. I was under the impression that the ()
inside the character class would be recognized and concatenate the
characters to a string (If I ever find this website again... Grr!).

After yet another hour of reading it became clear to me that there is no
way to write RegEx like "Match anything but this string" other than with
look-ahead/-behind. So my solution for now is

print join(" ", grep { /^(?!123:45)/ } @bar), "\n";

Hopefully this now works as intended, not only by accident.

It looks like from your example you could also do:

print join( " ", grep !/^123:45$/, @bar ), "\n";

Or:

print join( " ", grep $_ ne '123:45', @bar ), "\n";

John

Damian Lukowski · Feb 13, 2008

Stephan said:
After yet another hour of reading it became clear to me that there is no
way to write RegEx like "Match anything but this string" other than with
look-ahead/-behind.

In case it is an "academic" regular expression (as defined in computer
science) with no backreferences and zero-width assertions, there is
always another regex which matches exactly the opposite, as there is an
algorithm for giving you the complementary regex. If I'm not mistaken
the complementary regex to /two/ is something like
/^(?:[^t]|t+[^tw]|(?:t+w)+([^to]|t+[^tw]))*(?:t+w)+o/.

Damian

Damian Lukowski · Feb 13, 2008

Damian said:
If I'm not mistaken
the complementary regex to /two/ is something like
/^(?:[^t]|t+[^tw]|(?:t+w)+([^to]|t+[^tw]))*(?:t+w)+o/.

Damian

Well, I am mistaken here, as I forgot an important step.

I won't correct it, because it will be more complicated as it is already.

Stephan Mann · Feb 13, 2008

Stephan said:
Stephan said:

After yet another hour of reading it became clear to me that there is no
way to write RegEx like "Match anything but this string" other than with
look-ahead/-behind.

Click to expand...

In case it is an "academic" regular expression (as defined in computer
science) with no backreferences and zero-width assertions, there is
always another regex which matches exactly the opposite, as there is an
algorithm for giving you the complementary regex. If I'm not mistaken
the complementary regex to /two/ is something like
/^(?:[^t]|t+[^tw]|(?:t+w)+([^to]|t+[^tw]))*(?:t+w)+o/.

Let me rephrase that: There is no way to write _a readable_ RegEx... ;-)

Actually, I thought about this (although I would have failed writing it
down) but this simply isn't applicable for strings longer than five
characters and also doesn't work if your search string is variable.

Thankfully, my problem was very "practical"

thanks again for your efforts,
stephan

David Combs · Mar 8, 2008

Damian said:
Damian said:

If I'm not mistaken
the complementary regex to /two/ is something like
/^(?:[^t]|t+[^tw]|(?:t+w)+([^to]|t+[^tw]))*(?:t+w)+o/.

Damian

Click to expand...

Well, I am mistaken here, as I forgot an important step.
I won't correct it, because it will be more complicated as it is already.

C'mon, give it a shot.

After all, you did suck us into this. :=)

And as something to add to a regexp-tutorial, describe
how your existing line was supposed to work and accomplish --
and then what was wrong with it.

As you attach your fix, sequential thinking to it, with
the pros and cons you went through on the way there.

Of course no one would have time to do such a thing,
but maybe the idea of that kind of annotation will
stick in someone's head, and actually be done from
time to time.

David

Damian Lukowski · Mar 8, 2008

David said:
C'mon, give it a shot.

After all, you did suck us into this. :=)

And as something to add to a regexp-tutorial, describe
how your existing line was supposed to work and accomplish --
and then what was wrong with it.

Well, okay.

The rough approach to invert a regular expression is this:

- Convert the regex to an equivalent epsilon-NFA.
- Eliminate epsilon transitions.
- Convert NFA to equivalent DFA.
- Invert final state(s) to nonfinal state(s) and vice versa.
- Convert the inverted DFA back into a regular expression.

Above, I forgot the fourth step and converted a DFA into a regular
expression without inverting any states. Thus, the former /two/ should
be equivalent to the bloated one.

Damian

paul · Mar 12, 2008

Well, okay.

The rough approach to invert a regular expression is this:

- Convert the regex to an equivalent epsilon-NFA.
- Eliminate epsilon transitions.
- Convert NFA to equivalent DFA.
- Invert final state(s) to nonfinal state(s) and vice versa.
- Convert the inverted DFA back into a regular expression.

Above, I forgot the fourth step and converted a DFA into a regular
expression without inverting any states. Thus, the former /two/ should
be equivalent to the bloated one.

Damian

Hello.
Is there any library or tool which can do it for us.

How to read a file as binary or hex "string" so that I can do regex search?	3	Dec 18, 2024
Range / empty list issues??	1	Dec 10, 2023
ValueError - "Found input variables with inconsistent numbers of samples: [100, 120]"	1	Jul 27, 2023
Odd regex behavior	9	Sep 30, 2007
My regex kung-fu is not strong =(	0	Apr 4, 2020
Python List Comprehension Error: Unexpected Output	1	Aug 28, 2023
Perl scalars as numbers or character strings	36	Jul 6, 2009
'depth n' combinations of a sequence of numbers	17	Nov 11, 2013

Inverted RegEx on list of numbers

Stephan Mann

Damian Lukowski

Achim Peters

Stephan Mann

Stephan Mann

John W. Krahn

Damian Lukowski

Damian Lukowski

Stephan Mann

David Combs

Damian Lukowski

paul

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads