Negative lookahead regex clarification needed

Discussion in 'Perl Misc' started by shifty, Jan 19, 2005.

  1. shifty

    shifty Guest

    Hi,

    I'm trying to hack my way through a regex for a chunk of code I'm going
    to use. I've been using a Regex Coach to run through this and I think
    I have correct syntax.

    I am trying to find any one of several 'hacked' variants of the word
    "microsoft" (ex: m1cr0s0ft, miçr0§0ft, etc.), but NOT match on the
    actual word "microsoft". I need the regex to be case sensitive.

    This is my regex - it seems to work, but I don't know if the syntax is
    honestly correct and I don't want it to break later:

    (?i).*\b(?:(?!microsoft)m+[i1l\\\|!¡îíìï]+[Cç]+r+[o0öøõôóòð]+[s§]+[o0öøõôóòð]+f+[t\+]+)\b.*

    This expression will:
    Be case insensitive
    Have a word boundary to limit only finding the word I'm looking for
    Allow anything to preceed this word's boundaries
    Match on several variants of 'microsoft' as long as negative lookahead
    doesn't find the proper spelling
    Will not capture the match if one is found

    Is this correct? Any help is appreciated. I'm going to need to knock
    out several of these things.

    I'm just starting with regex, and I'm totally in love - but it's really
    easy to be inefficient and it's also easy really, really easy to miss
    "false positives" caused by overlooking an aspect of your expression.
    Reminds me of 'chess vs. chemistry' or something.
     
    shifty, Jan 19, 2005
    #1
    1. Advertising

  2. On Wed, 19 Jan 2005, shifty wrote:

    > I'm trying to hack my way through a regex for a chunk of code I'm going
    > to use. I've been using a Regex Coach to run through this and I think
    > I have correct syntax.


    I didn't know what "Regex Coach" is (I do now, courtesy of Google),
    but I find "pcretest" (part of the PCRE package from Phil Hazel) to be
    a valuable aid.

    > I am trying to find any one of several 'hacked' variants of the word
    > "microsoft" (ex: m1cr0s0ft, miçr0§0ft, etc.), but NOT match on the
    > actual word "microsoft". I need the regex to be case sensitive.


    Off the top of my head: Perhaps it would be better to do a character
    translation on the string, and then compare the result with the
    original.

    OTOH, if you're in a context where only a regex is acceptable (you're
    not by any chance writing recipes for spamassassin?) then I might have
    to take that back.
     
    Alan J. Flavell, Jan 19, 2005
    #2
    1. Advertising

  3. shifty

    shifty Guest


    > I didn't know what "Regex Coach" is (I do now, courtesy of Google),
    > but I find "pcretest" (part of the PCRE package from Phil Hazel) to

    be
    > a valuable aid.


    I'll hafta check that out.


    > OTOH, if you're in a context where only a regex is acceptable (you're


    > not by any chance writing recipes for spamassassin?) then I might

    have
    > to take that back.


    I am writing recipes for spam rejection, you're sharp ;)

    I'm writing something specific to PCRE. I couldn't find any current
    regex-specific groups.
     
    shifty, Jan 21, 2005
    #3
  4. shifty

    shifty Guest


    > If the syntax weren't correct it wouldn't compile. What you are

    asking is
    > whether it does what you want it to do, which is about semantics.


    For the purpose it's being used, it is not necessary to compile the
    regex. It's being accessed from an outside resource (spam filter).


    > Is there any reason why you want to use lookahead to exclude

    unaltered
    > strings like "microsoft"? Just skip those strings using an extra

    regex,
    > and concentrate on matching the altered variants.


    Yes. I don't want to bounce legitimate emails. Spam emails offering
    their software almost always misspell it at some point; I want to
    bounce anything I can be 99% certain is spam.
     
    shifty, Jan 21, 2005
    #4
  5. shifty

    shifty Guest

    Jim Gibson wrote:
    > In article <>,
    > shifty <> wrote:
    >
    > Yes, it does work, but it could be simplified:


    I'm still not sure how, though :) Seriously, though, I've noticed it
    works for everything but microsof+ (non-word character @ end of
    expression! You actually noted this :) )

    > 1, It is useless to have .* at the beginning and end of the regex.


    For the purpose it's being used (spam filter rule), it is necessary.

    > 2. It is useless to group with (?: ... ) in this case


    You're right ... I was doing this because I didn't want to capture the
    match.

    > 3. You don't need all of the plus signs unless you expect repeated
    > characters.


    I do. Spam emails with "hacked" words often use repeat characters to
    fool keyword filtering.

    > 9. Dont forget $ as a replacement for s, $ needs escaping in
    > double-quote context of a regular expression.


    Thanks, missed that one. I hadn't even thought about it. I was
    running through an ASCII character map to look at similar
    characters...dunno how I missed the $ sign.

    >
    > With all of the above points in mind, I would suggest the following:
    >
    > my $regex = qr(
    > (?:\b|\s)
    > (?!microsoft)
    > m
    > [i1l\\\|!¡îíìï]
    > [Cç]
    > r
    > [o0öøõôóòð]
    > [s§\$]
    > [o0öøõôóòð]
    > f
    > [t+]
    > (?:\b|\s)
    > )ix;
    >


    Thanks! I'm going to play with your suggestion for a bit, I think this
    should work. I need to make some versions for pharmaceutical spam as
    well. Should work perfect!


    > Are you looking for other approximations such as 'microsloth' and
    > 'microsquash'?


    Nah, because spammers don't usually do things like that.

    Thanks again for your insight. Couldn't have asked for a more perfect
    answer!
     
    shifty, Jan 21, 2005
    #5
  6. On Fri, 21 Jan 2005, shifty wrote:

    > Jim Gibson wrote:


    > > 2. It is useless to group with (?: ... ) in this case

    >
    > You're right ... I was doing this because I didn't want to capture the
    > match.


    I think Jim means that the negative-lookahead syntax is itself
    non-capturing, despite the parentheses - so you did't need to nullify
    the capturing anyway.

    If you already realised that - apologies in advance.

    No, I don't know where to raise questions specifically about regexes,
    either. But the Perl regulars seem quite a bit more tolerant of
    off-topically regex-related questions here, than they are about
    off-topically CGI questions here :-}
     
    Alan J. Flavell, Jan 21, 2005
    #6
  7. shifty

    Anno Siegel Guest

    shifty <> wrote in comp.lang.perl.misc:
    >
    > > If the syntax weren't correct it wouldn't compile. What you are

    > asking is
    > > whether it does what you want it to do, which is about semantics.

    >
    > For the purpose it's being used, it is not necessary to compile the
    > regex. It's being accessed from an outside resource (spam filter).


    Something is going to compile it. Every regex engine in existence
    does that.

    My point was the misuse of "syntax" for "correct code". It's becoming a
    sore spot.

    > > Is there any reason why you want to use lookahead to exclude

    > unaltered
    > > strings like "microsoft"? Just skip those strings using an extra

    > regex,
    > > and concentrate on matching the altered variants.

    >
    > Yes. I don't want to bounce legitimate emails. Spam emails offering
    > their software almost always misspell it at some point; I want to
    > bounce anything I can be 99% certain is spam.


    That's inconclusive, but since you didn't say what your spam filter
    actually does with the regex, there's no way of telling.

    Anno
     
    Anno Siegel, Jan 21, 2005
    #7
  8. shifty

    shifty Guest


    > No, I don't know where to raise questions specifically about regexes,


    > either. But the Perl regulars seem quite a bit more tolerant of
    > off-topically regex-related questions here, than they are about
    > off-topically CGI questions here :-}


    For that, I'm really thankful. Nothing like getting your ass lit up by
    someone when you truly mean well, look twice to make sure you're trying
    to do the right thing, then you get flamed to holy hell for trying to
    be as cautious and netiqueete-oriented as possible. :D
     
    shifty, Jan 25, 2005
    #8
  9. shifty

    shifty Guest


    > Something is going to compile it. Every regex engine in existence
    > does that.


    I would guess they're never compiled - regexes are interpreted, eh?
    So, in essence, if I am writing a regex for perl in particular (we'll
    keep it on-topic), perl is an interpreted language and so is a regex,
    so it's processed on the fly instead of compiling it into an object for
    future use. Unless I'm misinterpreting your use of "compile". If so,
    I have a true interest in understanding if you don't mind explaining.


    > My point was the misuse of "syntax" for "correct code". It's

    becoming a
    > sore spot.


    My apologies. I think we have conflicting views on what a regex really
    is. To me, a regex is a sentence or formula which expresses any number
    of meanings. Without the correct characters pattern (and/or placement)
    within the text (and/or string), you don't have a correct statement.

    If you don't produce a correct statement because one or more characters
    are misplaced, is it a syntax error or a code error?

    > That's inconclusive, but since you didn't say what your spam filter
    > actually does with the regex, there's no way of telling.


    I use these regex expressions for both SpamAssassin and Vamsoft's Open
    Relay Filter EE. Depends on which mailserver I'm dealing with
    (personal, co-hosted or business). I primarily do more administration
    and hosting type stuff than I do programming - if that's not blatantly
    obvious already.
    Thanks for your input, looking forward to clarification.

    >
    > Anno
     
    shifty, Jan 25, 2005
    #9
  10. shifty

    Guest

    "Alan J. Flavell" <> wrote:
    > On Fri, 21 Jan 2005, shifty wrote:
    >
    > > Jim Gibson wrote:

    >
    > > > 2. It is useless to group with (?: ... ) in this case

    > >
    > > You're right ... I was doing this because I didn't want to capture the
    > > match.

    >
    > I think Jim means that the negative-lookahead syntax is itself
    > non-capturing, despite the parentheses - so you did't need to nullify
    > the capturing anyway.
    >
    > If you already realised that - apologies in advance.
    >
    > No, I don't know where to raise questions specifically about regexes,
    > either. But the Perl regulars seem quite a bit more tolerant of
    > off-topically regex-related questions here, than they are about
    > off-topically CGI questions here :-}


    That's probably because CGI is a complete specification of its own,
    independent of Perl; while Perl regexes are not independent of Perl.
    People who ask here about the quirks of Java or .net regexes do
    get a chilly reception.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jan 26, 2005
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jelle Smet
    Replies:
    2
    Views:
    744
    Helmut Jarausch
    Nov 23, 2009
  2. MRAB
    Replies:
    0
    Views:
    910
  3. Phrogz
    Replies:
    2
    Views:
    128
    William James
    Feb 19, 2005
  4. Axel Etzold
    Replies:
    5
    Views:
    137
    Axel Etzold
    Jun 16, 2007
  5. Replies:
    2
    Views:
    163
    Anno Siegel
    Dec 29, 2004
Loading...

Share This Page