Negative lookahead regex clarification needed

S

shifty

Hi,

I'm trying to hack my way through a regex for a chunk of code I'm going
to use. I've been using a Regex Coach to run through this and I think
I have correct syntax.

I am trying to find any one of several 'hacked' variants of the word
"microsoft" (ex: m1cr0s0ft, miçr0§0ft, etc.), but NOT match on the
actual word "microsoft". I need the regex to be case sensitive.

This is my regex - it seems to work, but I don't know if the syntax is
honestly correct and I don't want it to break later:

(?i).*\b(?:(?!microsoft)m+[i1l\\\|!¡îíìï]+[Cç]+r+[o0öøõôóòð]+[s§]+[o0öøõôóòð]+f+[t\+]+)\b.*

This expression will:
Be case insensitive
Have a word boundary to limit only finding the word I'm looking for
Allow anything to preceed this word's boundaries
Match on several variants of 'microsoft' as long as negative lookahead
doesn't find the proper spelling
Will not capture the match if one is found

Is this correct? Any help is appreciated. I'm going to need to knock
out several of these things.

I'm just starting with regex, and I'm totally in love - but it's really
easy to be inefficient and it's also easy really, really easy to miss
"false positives" caused by overlooking an aspect of your expression.
Reminds me of 'chess vs. chemistry' or something.
 
A

Alan J. Flavell

I'm trying to hack my way through a regex for a chunk of code I'm going
to use. I've been using a Regex Coach to run through this and I think
I have correct syntax.

I didn't know what "Regex Coach" is (I do now, courtesy of Google),
but I find "pcretest" (part of the PCRE package from Phil Hazel) to be
a valuable aid.
I am trying to find any one of several 'hacked' variants of the word
"microsoft" (ex: m1cr0s0ft, miçr0§0ft, etc.), but NOT match on the
actual word "microsoft". I need the regex to be case sensitive.

Off the top of my head: Perhaps it would be better to do a character
translation on the string, and then compare the result with the
original.

OTOH, if you're in a context where only a regex is acceptable (you're
not by any chance writing recipes for spamassassin?) then I might have
to take that back.
 
S

shifty

I didn't know what "Regex Coach" is (I do now, courtesy of Google),
but I find "pcretest" (part of the PCRE package from Phil Hazel) to be
a valuable aid.

I'll hafta check that out.

OTOH, if you're in a context where only a regex is acceptable (you're
not by any chance writing recipes for spamassassin?) then I might have
to take that back.

I am writing recipes for spam rejection, you're sharp ;)

I'm writing something specific to PCRE. I couldn't find any current
regex-specific groups.
 
S

shifty

If the syntax weren't correct it wouldn't compile. What you are asking is
whether it does what you want it to do, which is about semantics.

For the purpose it's being used, it is not necessary to compile the
regex. It's being accessed from an outside resource (spam filter).

Is there any reason why you want to use lookahead to exclude unaltered
strings like "microsoft"? Just skip those strings using an extra regex,
and concentrate on matching the altered variants.

Yes. I don't want to bounce legitimate emails. Spam emails offering
their software almost always misspell it at some point; I want to
bounce anything I can be 99% certain is spam.
 
S

shifty

Jim said:
Yes, it does work, but it could be simplified:

I'm still not sure how, though :) Seriously, though, I've noticed it
works for everything but microsof+ (non-word character @ end of
expression! You actually noted this :) )
1, It is useless to have .* at the beginning and end of the regex.

For the purpose it's being used (spam filter rule), it is necessary.
2. It is useless to group with (?: ... ) in this case

You're right ... I was doing this because I didn't want to capture the
match.
3. You don't need all of the plus signs unless you expect repeated
characters.

I do. Spam emails with "hacked" words often use repeat characters to
fool keyword filtering.
9. Dont forget $ as a replacement for s, $ needs escaping in
double-quote context of a regular expression.

Thanks, missed that one. I hadn't even thought about it. I was
running through an ASCII character map to look at similar
characters...dunno how I missed the $ sign.
With all of the above points in mind, I would suggest the following:

my $regex = qr(
(?:\b|\s)
(?!microsoft)
m
[i1l\\\|!¡îíìï]
[Cç]
r
[o0öøõôóòð]
[s§\$]
[o0öøõôóòð]
f
[t+]
(?:\b|\s)
)ix;

Thanks! I'm going to play with your suggestion for a bit, I think this
should work. I need to make some versions for pharmaceutical spam as
well. Should work perfect!

Are you looking for other approximations such as 'microsloth' and
'microsquash'?

Nah, because spammers don't usually do things like that.

Thanks again for your insight. Couldn't have asked for a more perfect
answer!
 
A

Alan J. Flavell

Jim Gibson wrote:

You're right ... I was doing this because I didn't want to capture the
match.

I think Jim means that the negative-lookahead syntax is itself
non-capturing, despite the parentheses - so you did't need to nullify
the capturing anyway.

If you already realised that - apologies in advance.

No, I don't know where to raise questions specifically about regexes,
either. But the Perl regulars seem quite a bit more tolerant of
off-topically regex-related questions here, than they are about
off-topically CGI questions here :-}
 
A

Anno Siegel

shifty said:
For the purpose it's being used, it is not necessary to compile the
regex. It's being accessed from an outside resource (spam filter).

Something is going to compile it. Every regex engine in existence
does that.

My point was the misuse of "syntax" for "correct code". It's becoming a
sore spot.
Yes. I don't want to bounce legitimate emails. Spam emails offering
their software almost always misspell it at some point; I want to
bounce anything I can be 99% certain is spam.

That's inconclusive, but since you didn't say what your spam filter
actually does with the regex, there's no way of telling.

Anno
 
S

shifty

No, I don't know where to raise questions specifically about regexes,
either. But the Perl regulars seem quite a bit more tolerant of
off-topically regex-related questions here, than they are about
off-topically CGI questions here :-}

For that, I'm really thankful. Nothing like getting your ass lit up by
someone when you truly mean well, look twice to make sure you're trying
to do the right thing, then you get flamed to holy hell for trying to
be as cautious and netiqueete-oriented as possible. :D
 
S

shifty

Something is going to compile it. Every regex engine in existence
does that.

I would guess they're never compiled - regexes are interpreted, eh?
So, in essence, if I am writing a regex for perl in particular (we'll
keep it on-topic), perl is an interpreted language and so is a regex,
so it's processed on the fly instead of compiling it into an object for
future use. Unless I'm misinterpreting your use of "compile". If so,
I have a true interest in understanding if you don't mind explaining.

My point was the misuse of "syntax" for "correct code". It's becoming a
sore spot.

My apologies. I think we have conflicting views on what a regex really
is. To me, a regex is a sentence or formula which expresses any number
of meanings. Without the correct characters pattern (and/or placement)
within the text (and/or string), you don't have a correct statement.

If you don't produce a correct statement because one or more characters
are misplaced, is it a syntax error or a code error?
That's inconclusive, but since you didn't say what your spam filter
actually does with the regex, there's no way of telling.

I use these regex expressions for both SpamAssassin and Vamsoft's Open
Relay Filter EE. Depends on which mailserver I'm dealing with
(personal, co-hosted or business). I primarily do more administration
and hosting type stuff than I do programming - if that's not blatantly
obvious already.
Thanks for your input, looking forward to clarification.
 
X

xhoster

Alan J. Flavell said:
I think Jim means that the negative-lookahead syntax is itself
non-capturing, despite the parentheses - so you did't need to nullify
the capturing anyway.

If you already realised that - apologies in advance.

No, I don't know where to raise questions specifically about regexes,
either. But the Perl regulars seem quite a bit more tolerant of
off-topically regex-related questions here, than they are about
off-topically CGI questions here :-}

That's probably because CGI is a complete specification of its own,
independent of Perl; while Perl regexes are not independent of Perl.
People who ask here about the quirks of Java or .net regexes do
get a chilly reception.

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top