Inverted RegEx on list of numbers


S

Stephan Mann

Hi!

I'm relatively new to Perl and I'm not a RegEx guru either. I'd like to
understand, why the following code does not work as I expect.
Obviously, I'm oblivious to some detail of how the RegEx engine works.

my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";

Output:

one three
123:46
123:44 123:46

I'm completely lost as to why the second grep doesn't work like the first
one. The colon doesn't seem to be the problem, since the behavior stays
the same without it. How or why are numbers handled differently?!
The third grep works, but it forces me to handle this in a separate
RegEx which I can't extend if a want to do more with one RegEx.

tia, stephan


PS: This is my first post to a non-test group with slrn. Please let me
know if there is anything wrong with my post.
 
Ad

Advertisements

D

Damian Lukowski

Stephan said:
Hi!

I'm relatively new to Perl and I'm not a RegEx guru either. I'd like to
understand, why the following code does not work as I expect.
Obviously, I'm oblivious to some detail of how the RegEx engine works.

my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";

[^(two)] is a character class, meaning a character which is neither '('
nor ')', nor 't', nor 'w', nor 'o', nor ')'.

"one" matches by /[^(two)]/, because there is a character 'n', which is
matched by [^(two)]. The same holds for 'h' in "three".

On the other hand, there is no character in "123:44" which is not an
'1', not an '2', and so on.
 
A

Achim Peters

Stephan said:
I'm relatively new to Perl and I'm not a RegEx guru either. I'd like to
understand, why the following code does not work as I expect.
Obviously, I'm oblivious to some detail of how the RegEx engine works.

You have a mere RE "problem". There is nothing perl specific in your
observations. I recommend reading more about RegExps. (Since your
problem is not perl related, any documentation about REs will do)
my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";

Output:

one three
123:46
123:44 123:46

I'm completely lost as to why the second grep doesn't work like the first
one.

The second grep _does_ work like the first one. Both just not the way
you expect them to work. ;-)
How or why are numbers handled differently?!

Numbers per se are _not_ handled differently.

"[]" matches a single (one!) character, no matter how many characters
you list in between the "[" and the "]". A character of the expression
to be "grepped" matches the "[]" if and only if it is listed within the
"[]" (ranges and classes of matching characters are possible), given the
first character after the "[" is not a "^", or respectively matches if
and only if it is *not* listed, given the first character after the "["
is a "^".

So, it doesn't matter, whether you write
grep { /[^(two)]/ }
or
grep { /[^otw()]/ }

In both cases the expression will match _a_ _single_ _character_ which
is neither "o" nor "t" nor "w" nor "(" nor ")". Parentheses have no
special meaning within [].

'one' has a character which fulfills this expression: "n" ('one' has
even one more character, which does, but since your regex consisted only
of the [] the list item is already matched by the one matching character.
'two' does not have any such character (one that is not a "t" nor a "w"
nor an "o" nor a "(" nor a ")"
'three' has the "h" --> match (and furthermore the "r" and the "e".

I hope, you see by now, why
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
worked the way it worked.

Bye
Achim
 
S

Stephan Mann

Stephan said:
my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";

[^(two)] is a character class, meaning a character which is neither '('
nor ')', nor 't', nor 'w', nor 'o', nor ')'.

"one" matches by /[^(two)]/, because there is a character 'n', which is
matched by [^(two)]. The same holds for 'h' in "three".

On the other hand, there is no character in "123:44" which is not an
'1', not an '2', and so on.

Thank you very much, Damian. I was under the impression that the ()
inside the character class would be recognized and concatenate the
characters to a string (If I ever find this website again... Grr!).

After yet another hour of reading it became clear to me that there is no
way to write RegEx like "Match anything but this string" other than with
look-ahead/-behind. So my solution for now is

print join(" ", grep { /^(?!123:45)/ } @bar), "\n";

Hopefully this now works as intended, not only by accident.

greetings, stephan
 
S

Stephan Mann

You have a mere RE "problem". There is nothing perl specific in your
observations. I recommend reading more about RegExps. (Since your
problem is not perl related, any documentation about REs will do)

The !/../ solution is Perl specific, isn't it? But of course you are
right, I got the RegEx completely wrong. Thank you and Abigail for the
detailed explanation.

greetings, stephan
 
J

John W. Krahn

Stephan said:
Stephan said:
my @foo = ('one', 'two', 'three');
print join " ", grep { /[^(two)]/ } @foo, "\n";

my @bar = ('123:44', '123:45', '123:46');
print join " ", grep { /[^(123:45)]/ } @bar, "\n";
print join " ", grep { !/123:45/ } @bar, "\n";
[^(two)] is a character class, meaning a character which is neither '('
nor ')', nor 't', nor 'w', nor 'o', nor ')'.

"one" matches by /[^(two)]/, because there is a character 'n', which is
matched by [^(two)]. The same holds for 'h' in "three".

On the other hand, there is no character in "123:44" which is not an
'1', not an '2', and so on.

Thank you very much, Damian. I was under the impression that the ()
inside the character class would be recognized and concatenate the
characters to a string (If I ever find this website again... Grr!).

After yet another hour of reading it became clear to me that there is no
way to write RegEx like "Match anything but this string" other than with
look-ahead/-behind. So my solution for now is

print join(" ", grep { /^(?!123:45)/ } @bar), "\n";

Hopefully this now works as intended, not only by accident.

It looks like from your example you could also do:

print join( " ", grep !/^123:45$/, @bar ), "\n";

Or:

print join( " ", grep $_ ne '123:45', @bar ), "\n";



John
 
Ad

Advertisements

D

Damian Lukowski

Stephan said:
After yet another hour of reading it became clear to me that there is no
way to write RegEx like "Match anything but this string" other than with
look-ahead/-behind.

In case it is an "academic" regular expression (as defined in computer
science) with no backreferences and zero-width assertions, there is
always another regex which matches exactly the opposite, as there is an
algorithm for giving you the complementary regex. If I'm not mistaken
the complementary regex to /two/ is something like
/^(?:[^t]|t+[^tw]|(?:t+w)+([^to]|t+[^tw]))*(?:t+w)+o/.

Damian
 
D

Damian Lukowski

Damian said:
If I'm not mistaken
the complementary regex to /two/ is something like
/^(?:[^t]|t+[^tw]|(?:t+w)+([^to]|t+[^tw]))*(?:t+w)+o/.

Damian

Well, I am mistaken here, as I forgot an important step. :)
I won't correct it, because it will be more complicated as it is already. :)
 
S

Stephan Mann

Stephan said:
After yet another hour of reading it became clear to me that there is no
way to write RegEx like "Match anything but this string" other than with
look-ahead/-behind.

In case it is an "academic" regular expression (as defined in computer
science) with no backreferences and zero-width assertions, there is
always another regex which matches exactly the opposite, as there is an
algorithm for giving you the complementary regex. If I'm not mistaken
the complementary regex to /two/ is something like
/^(?:[^t]|t+[^tw]|(?:t+w)+([^to]|t+[^tw]))*(?:t+w)+o/.

Let me rephrase that: There is no way to write _a readable_ RegEx... ;-)

Actually, I thought about this (although I would have failed writing it
down) but this simply isn't applicable for strings longer than five
characters and also doesn't work if your search string is variable.

Thankfully, my problem was very "practical" :)

thanks again for your efforts,
stephan
 
D

David Combs

Damian said:
If I'm not mistaken
the complementary regex to /two/ is something like
/^(?:[^t]|t+[^tw]|(?:t+w)+([^to]|t+[^tw]))*(?:t+w)+o/.

Damian

Well, I am mistaken here, as I forgot an important step. :)
I won't correct it, because it will be more complicated as it is already. :)

C'mon, give it a shot.

After all, you did suck us into this. :=)

And as something to add to a regexp-tutorial, describe
how your existing line was supposed to work and accomplish --
and then what was wrong with it.

As you attach your fix, sequential thinking to it, with
the pros and cons you went through on the way there.


Of course no one would have time to do such a thing,
but maybe the idea of that kind of annotation will
stick in someone's head, and actually be done from
time to time.


David
 
D

Damian Lukowski

David said:
C'mon, give it a shot.

After all, you did suck us into this. :=)

And as something to add to a regexp-tutorial, describe
how your existing line was supposed to work and accomplish --
and then what was wrong with it.


Well, okay.

The rough approach to invert a regular expression is this:

- Convert the regex to an equivalent epsilon-NFA.
- Eliminate epsilon transitions.
- Convert NFA to equivalent DFA.
- Invert final state(s) to nonfinal state(s) and vice versa.
- Convert the inverted DFA back into a regular expression.

Above, I forgot the fourth step and converted a DFA into a regular
expression without inverting any states. Thus, the former /two/ should
be equivalent to the bloated one.

Damian
 
Ad

Advertisements

P

paul

Well, okay.

The rough approach to invert a regular expression is this:

- Convert the regex to an equivalent epsilon-NFA.
- Eliminate epsilon transitions.
- Convert NFA to equivalent DFA.
- Invert final state(s) to nonfinal state(s) and vice versa.
- Convert the inverted DFA back into a regular expression.

Above, I forgot the fourth step and converted a DFA into a regular
expression without inverting any states. Thus, the former /two/ should
be equivalent to the bloated one.

Damian

Hello.
Is there any library or tool which can do it for us.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top