Boolean Regexp with perlre

K

Kevin Crosbie

Hi,

I'm not sure if this is appropriate for this group, it's related to
perlre rather than perl, if so, apologies and I'd appreciate a pointer
to the right group.

I'm trying to find out if this is possible before I embark upon
implementing a solution to my problem.

I have a string with comma separated tags:
"a, b, c, d, e, f"

It's rather easy to write something to express a boolean OR:
a OR b OR c = (^|,(\s)+)(a|b|c)[,$]

What I would like to know is if there is a way to express AND or NOT:
1. (a OR b) AND c
2. (a OR b) AND NOT c
3. (a OR b) AND NOT (c OR d)

I imagine there is no nice way to do this without doing something like
writing out your AND clause before and after whatever OR clause you are
using, which would become really messy for more complicated expressions,
but perhaps someone knows of some way to do this.

Regards,

Kevin
 
B

Brian McCauley

Kevin said:
I'm not sure if this is appropriate for this group, it's related to
perlre rather than perl, if so, apologies and I'd appreciate a pointer
to the right group.

I do not know how much of Perl REs are implemented by perlre so the
solutions I offer may not apply.
I'm trying to find out if this is possible before I embark upon
implementing a solution to my problem.

I have a string with comma separated tags:
"a, b, c, d, e, f"

It's rather easy to write something to express a boolean OR:
a OR b OR c = (^|,(\s)+)(a|b|c)
What I would like to know is if there is a way to express AND or NOT:
1. (a OR b) AND c

/^(?=.*(^|,(\s)+)(c|d)(,|$)).*(^|,(\s)+)(a|b)(,|$)/

Note: if you want to trade efficiency for readbility you can make all
the capturing (...) into non-captureing (?:...)

Note: I've assumed your data contains no characters that don't match
the period regex.
3. (a OR b) AND NOT (c OR d)
/^(?!.*(^|,(\s)+)(c|d)).*(^|,(\s)+)(a|b)(,|$)/

I imagine there is no nice way to do this without doing something like
writing out your AND clause before and after whatever OR clause you are
using, which would become really messy for more complicated expressions,
but perhaps someone knows of some way to do this.

But REs are really the wrong tool for the job. If this were a real
Perl question I'd say change your API to take a CODE ref rather than a
regex.
 
J

John W. Krahn

Kevin said:
I'm not sure if this is appropriate for this group, it's related to
perlre rather than perl, if so, apologies and I'd appreciate a pointer
to the right group.

I'm trying to find out if this is possible before I embark upon
implementing a solution to my problem.

I have a string with comma separated tags:
"a, b, c, d, e, f"

It's rather easy to write something to express a boolean OR:
a OR b OR c = (^|,(\s)+)(a|b|c)[,$]

What I would like to know is if there is a way to express AND or NOT:
1. (a OR b) AND c

/a|b/ && /c/

2. (a OR b) AND NOT c

/a|b/ && !/c/

3. (a OR b) AND NOT (c OR d)

/a|b/ && !/c|d/




John
 
J

jl_post

Kevin said:
I have a string with comma separated tags:
"a, b, c, d, e, f"

It's rather easy to write something to express a boolean OR:
a OR b OR c = (^|,(\s)+)(a|b|c)[,$]

Um, I don't think the "[,$]" is doing what you think it should be
doing (in fact, I don't think that'll even compile). What you should
use in its place is "(,|$)".
What I would like to know is if there is a way to express AND or NOT:
1. (a OR b) AND c
2. (a OR b) AND NOT c
3. (a OR b) AND NOT (c OR d)

I imagine there is no nice way to do this without doing something like
writing out your AND clause before and after whatever OR clause you are
using, which would become really messy for more complicated expressions,
but perhaps someone knows of some way to do this.

As far as I know, writing out your AND clause before and after
whatever OR clause you are using will be the simplest and most readable
solution. However, you can do what you want using more complicated
expressions, taking advantage of Perl's extended regular expressions.
(You can read "perldoc perlre" to find out more about them.)

Let me warn you, though, they can be rather "messy." For this
reason, instead of searching for commas (or the beginning/end of the
string) like you have, I'll just leave those out, as if all the
elements were one character long. This won't always be the case, of
course, but I figure that you'll be able to add the delimeter detection
in yourself later, but for now, I won't put it in for simplicity's
sake.
1. (a OR b) AND c

For this one, you basically want to search use (a|b), but you also
want to look for c, which may come before or after (a|b). So you can
use this regular expression:

m/(a|b).*c|c.*(a|b)/
2. (a OR b) AND NOT c

This one is trickier, because in order to verify that there is no
'c', you must search the entire line. If 'c' were just one character
long, we could get away with using the [^c] character class, like this:

m/^[^c]*(a|b)[^c]*$/

This searches for 'a' or 'b', but makes sure that ALL the characters
before AND after are NOT 'c'.

However, it's likely that your 'c' term won't be one character long.
In that case, you'll probably want to use a "negative look-ahead"
assertion (again, look it up in "perldoc perlre" if you want to read
details about it -- this is one of extended regular expressions I
mentioned earlier). That way we would have:

m/^((?!c).)*(a|b)((?!c).)*$/

This pattern is essentially the same as the previous one, except
instead of having "[^c]" (which assumes that 'c' is one character
long), we have "((?!c).)". What this pattern matches is any character
provided that 'c' is not immediately found at that spot.

To clarify, if 'c' was actually the string "car", you would write
the term as "((?!car).)". Notice that you still use one '.' even
though "car" is three letters long. That's because the '.' only
matches one character, but with the (?!car) in front of it it'll only
match if that character is not a 'c' that is followed by an 'a' and an
'r'.

(If you put three '.' instead of just one, then the "((?!car)...)*"
expression would match multiples of three characters, which is not what
you want.)

Of course, just as a '*' follows "[^c]", one should also follow
"((?!c).)" because you are necessarily searching through more than one
character (we'll assume that there is more than one character that
comes before and after "(a|b)").
3. (a OR b) AND NOT (c OR d)

This one is pretty much the same as the previous example, except
that instead of using "(?!c)" you'll replace it with "(?!c|d)", like
this:

m/^((?!c|d).)*(a|b)((?!c|d).)*$/

That's pretty much it. Are the expressions messy? Most people
would say yes, so you might want to seriously consider breaking out
each of the above regular expressions into more than one, if only for
readability's sake.

Another tip: Whenever you use a complicated regular expression,
consider putting a comment right above it that clearly states what it's
searching for. For example, you might write your code to look like:

# Look for (a OR b) AND NOT (c OR d):
if ($string =~ m/^((?!c|d).)*(a|b)((?!c|d).)*$/)

This will make your code easier to understand and to debug. Without
the comment, any maintainer that comes after you will have a puzzle to
solve in order to figure out what you really meant. And if for some
reason you (or a future maintainer) introduced a bug in your regular
expression, the comment can serve as a guide to determine whether or
not a bug actually exists in the regular expression (otherwise, it
would be difficult to know for sure).

I hope this helps, Kevin.

-- Jean-Luc
 
X

Xicheng Jia

Kevin said:
Hi,

I'm not sure if this is appropriate for this group, it's related to
perlre rather than perl, if so, apologies and I'd appreciate a pointer
to the right group.

I'm trying to find out if this is possible before I embark upon
implementing a solution to my problem.

I have a string with comma separated tags:
"a, b, c, d, e, f"

It's rather easy to write something to express a boolean OR:
a OR b OR c = (^|,(\s)+)(a|b|c)[,$]

What I would like to know is if there is a way to express AND or NOT:

If you want to test the whole string instead of its sub-strings, then
1. (a OR b) AND c
^(?=.*(a|b))(?=.*c)

2. (a OR b) AND NOT c
^(?=.*(a|b))(?!.*c)

3. (a OR b) AND NOT (c OR d)

^(?=.*(a|b))(?!.*(c|d))

Assumed your a,b,c,d are strings instead of chars, otherwise change
(a|b) to [ab]. you may also want to change (a|b) to (?:a|b) for
efficiency. do the similar thing to (c|d).

Regards,
Xicheng
 
D

Dr.Ruud

Kevin Crosbie schreef:
I have a string with comma separated tags:
"a, b, c, d, e, f"

It's rather easy to write something to express a boolean OR:
a OR b OR c = (^|,(\s)+)(a|b|c)[,$]

What I would like to know is if there is a way to express AND or NOT:
1. (a OR b) AND c
2. (a OR b) AND NOT c
3. (a OR b) AND NOT (c OR d)

I imagine there is no nice way to do this without doing something like
writing out your AND clause before and after whatever OR clause you
are using, which would become really messy for more complicated
expressions, but perhaps someone knows of some way to do this.

From your "before and after" I assume that 1. is true for the string "c,
a".

1. /[ab]/ and /c/

/(?:[ab].*c|c.*[ab])/

2. /[ab]/ and !/c/

/^[^c]*[ab][^c]*$/

3. /[ab]/ and !/[cd]/

/^[^cd]*[ab][^cd]*$/
 
D

Dr.Ruud

Kevin Crosbie schreef:
I have a string with comma separated tags:
"a, b, c, d, e, f"

It's rather easy to write something to express a boolean OR:
a OR b OR c = (^|,(\s)+)(a|b|c)[,$]

What I would like to know is if there is a way to express AND or NOT:
1. (a OR b) AND c
2. (a OR b) AND NOT c
3. (a OR b) AND NOT (c OR d)

I imagine there is no nice way to do this without doing something like
writing out your AND clause before and after whatever OR clause you
are using, which would become really messy for more complicated
expressions, but perhaps someone knows of some way to do this.

From your "before and after" I take it that 1. should be
true for the string "c, a" (so the shown order in your example
is not productive).

If your tags all have length one, as shown:

1. /[ab]/ && /c/

/[ab].*c|c.*[ab]/

2. /[ab]/ && !/c/

/^[^c]*[ab][^c]*$/

3. /[ab]/ && !/[cd]/

/^[^cd]*[ab][^cd]*$/


If your tags can be like "foo, bar, baz, etc.",
then change "[ab]" to "\b(?:foo|bar)\b"
and "c" to "\bbaz\b" or just as well to "\b(?:baz)\b"
and "[cd]" to "\b(?:baz|etc\.)(?=,|$)".

1. /\b(?:foo|bar)\b/ && /\b(?:baz)\b/

/\b(?:foo|bar)\b.*\b(?:baz)\b|\b(?:baz)\b.*\b(?:foo|bar)\b/

2. /\b(?:foo|bar)\b/ && !/\b(?:baz)\b/

3. /\b(?:foo|bar)\b/ && !/\b(?:baz|etc\.)(?=,|$)/

Check out "(?<!pattern)" and "(?!pattern)" in perlre, if you want to
write 2. and 3. as a single regex.
 
K

Kevin Crosbie

A. Sinan Unur said:
It all depends on what you mean perlre. If you referring to regular
expressions as implemented in Perl, then it is the right group. However,
if somehow you are referring to some library that implements a regular
expression facility similar to Perl's, it is probably not.

Sinan

You are correct, I'm not actually using perlre.
The library in question is OROMatcher, which claims to implement Regular
Expressions "exactly" as outlined in perlre/Perl5. It also claims to
support Perl5 extended regular expressions fully.

I have no idea of the credibility of the claim, but it seems likely that
advice for perlre should be the same if not quite close to what I'm
looking for.

I've actually since read that OROMatcher was made opensource for Apache,
but I still think this group is more appropriate.

I've worked with Regular Expressions in the past with various different
languages. I always find that I need to spend lots of time digging
right back into them to solve the problem I'm tackling. Hence I posted
here first just to make sure I wasn't going down the wrong path before
investing time.

'Some people, when confronted with a problem, think “I know, I’ll use
regular expressions.” Now they have two problems.' —Jamie Zawinski

The advice/suggestions/critiques I've gotten so far have been great!
They've gone beyond what I had hoped for. I'm quite impressed with how
helpful the c.l.perl.misc group is.

Many Thanks to all,

Kevin
 
K

Kevin Crosbie

Kevin Crosbie wrote:

Um, I don't think the "[,$]" is doing what you think it should be
doing (in fact, I don't think that'll even compile). What you should
use in its place is "(,|$)".

Can't you define a character class consisting of , and endline using
[,$]? I'm probably confused though. It's been a while since I've
used Regular Expressions.
However, it's likely that your 'c' term won't be one character long.
In that case, you'll probably want to use a "negative look-ahead"
assertion (again, look it up in "perldoc perlre" if you want to read
details about it -- this is one of extended regular expressions I
mentioned earlier). That way we would have:

m/^((?!c).)*(a|b)((?!c).)*$/

I had played around with this but couldn't get it to do what I wanted.
Thanks for the example.
That's pretty much it. Are the expressions messy? Most people
would say yes, so you might want to seriously consider breaking out
each of the above regular expressions into more than one, if only for
readability's sake.

I imagined the solution for this using a single regular expression would
be messy. Timing is an issue and I'm calling out to an external API to
perform the regular expressions. I imagine using negative look-ahead
means you must make at least one full pass on your input, but I think
this is a lot less overhead than making multiple calls to an external
API. My input is never longer than 4000 characters.
Another tip: Whenever you use a complicated regular expression,
consider putting a comment right above it that clearly states what it's
searching for. For example, you might write your code to look like:

# Look for (a OR b) AND NOT (c OR d):
if ($string =~ m/^((?!c|d).)*(a|b)((?!c|d).)*$/)

This will make your code easier to understand and to debug. Without
the comment, any maintainer that comes after you will have a puzzle to
solve in order to figure out what you really meant. And if for some
reason you (or a future maintainer) introduced a bug in your regular
expression, the comment can serve as a guide to determine whether or
not a bug actually exists in the regular expression (otherwise, it
would be difficult to know for sure).

Good advice. I agree, just trying to construct the correct expression
seems quite tricky. I wouldn't like to have anybody try to reverse
engineer this kind of stuff just to find out what it does.
I hope this helps, Kevin.

Many Thanks.
 
K

Kevin Crosbie

Xicheng said:
Assumed your a,b,c,d are strings instead of chars, otherwise change
(a|b) to [ab]. you may also want to change (a|b) to (?:a|b) for
efficiency. do the similar thing to (c|d).

Regards,
Xicheng

Thanks Xicheng
 
K

Kevin Crosbie

Brian said:
But REs are really the wrong tool for the job. If this were a real
Perl question I'd say change your API to take a CODE ref rather than a
regex.

Hi Brian,

Thanks for the reply. If by CODE ref you mean a function pointer, then
you could be right, the problem I have might result in more
readable/maintainable/extendible code were I to pass a function. But
then I'd write it in Lisp ;)

In this case I don't have that option though.

Kevin
 
T

Tad McClellan

Kevin Crosbie said:
Kevin Crosbie wrote:

Um, I don't think the "[,$]" is doing what you think it should be
doing (in fact, I don't think that'll even compile). What you should
use in its place is "(,|$)".

Can't you define a character class consisting of , and endline using ^^^^^^^^^^^^^

s/endline/endstring/


[,$]?


No, you cannot.

A character class matches exactly one character.

An anchor matches a _position_ (ie. zero characters).

In Perl, it is further complicated in that

[,$]

is not a character class because $] is a variable, there is
no closing ] for that "class".
 
U

Uri Guttman

KC> You are correct, I'm not actually using perlre. The library in
KC> question is OROMatcher, which claims to implement Regular
KC> Expressions "exactly" as outlined in perlre/Perl5. It also claims
KC> to support Perl5 extended regular expressions fully.

then they are lying like all perl regex emulators. perl has the ability
to use perl code inside a regex. hard to emulate that without having
perl inside.

and i always love how every other lang slams perl and always claims perl
compatible regexes. wait until they try to get perl6 grammars emulation
working. but since the PGE will be written in parrot which you can
embed, maybe they will actually do it right. but again, PGE allows perl
code inside so that will again stifle full emulation. too bad.

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top