[OT] a little about regex

Discussion in 'Python' started by Fulvio, Oct 18, 2006.

  1. Fulvio

    Fulvio Guest

    ***********************
    Your mail has been scanned by InterScan MSS.
    ***********************


    Hello,

    I'm trying to get working an assertion which filter address from some domain
    but if it's prefixed by '.com'.
    Even trying to put the result in a negate test I can't get the wanted result.

    The tought in program term :

    >>> def filter(adr):

    .... import re
    .... allow = re.compile('.*\.my(>|$)')
    .... deny = re.compile('.*\.com\.my(>|$)')
    .... cnt = 0
    .... if deny.search(adr): cnt += 1
    .... if allow.search(adr): cnt += 1
    .... return cnt
    ....
    >>> filter('')

    2
    >>> filter('')

    1
    >>>


    Seem that I miss some better regex implementation to avoid that both of the
    filters taking action. I'm thinking of lookbehind (negative or positive)
    option, but I think I couldn't realize it yet.
    I think the compilation should either allow have no '.com' before '.my' or
    deny should have _only_ '.com' before '.my'. Sorry I don't get the correct
    sintax to do it.

    Suggestions are welcome.

    F
     
    Fulvio, Oct 18, 2006
    #1
    1. Advertising

  2. Fulvio

    Ron Adam Guest

    Fulvio wrote:
    > ***********************
    > Your mail has been scanned by InterScan MSS.
    > ***********************
    >
    >
    > Hello,
    >
    > I'm trying to get working an assertion which filter address from some domain
    > but if it's prefixed by '.com'.
    > Even trying to put the result in a negate test I can't get the wanted result.
    >
    > The tought in program term :
    >
    >>>> def filter(adr):

    > ... import re
    > ... allow = re.compile('.*\.my(>|$)')
    > ... deny = re.compile('.*\.com\.my(>|$)')
    > ... cnt = 0
    > ... if deny.search(adr): cnt += 1
    > ... if allow.search(adr): cnt += 1
    > ... return cnt
    > ...
    >>>> filter('')

    > 2
    >>>> filter('')

    > 1
    >
    > Seem that I miss some better regex implementation to avoid that both of the
    > filters taking action. I'm thinking of lookbehind (negative or positive)
    > option, but I think I couldn't realize it yet.
    > I think the compilation should either allow have no '.com' before '.my' or
    > deny should have _only_ '.com' before '.my'. Sorry I don't get the correct
    > sintax to do it.
    >
    > Suggestions are welcome.
    >
    > F


    Instead of using two separate if's, Use an if - elif and be sure to test the
    narrower filter first. (You have them in the correct order) That way it will
    skip the more general filter and not increment cnt twice.

    It's not exactly clear on what output you are seeking. If you want 0 for not
    filtered and 1 for filtered, then look to Freds Hint.

    Or are you writing a test at the moment, a 1 means it only passed one filter so
    you know your filters are working as designed?

    Another approach would be to assign values for filtered, accepted, and undefined
    and set those accordingly instead of incrementing and decrementing a counter.

    Cheers,
    Ron
     
    Ron Adam, Oct 18, 2006
    #2
    1. Advertising

  3. Fulvio

    Rob Wolfe Guest

    Re: a little about regex

    Fulvio wrote:

    > I'm trying to get working an assertion which filter address from some domain
    > but if it's prefixed by '.com'.
    > Even trying to put the result in a negate test I can't get the wanted result.


    [...]

    > Seem that I miss some better regex implementation to avoid that both of the
    > filters taking action. I'm thinking of lookbehind (negative or positive)
    > option, but I think I couldn't realize it yet.
    > I think the compilation should either allow have no '.com' before '.my' or
    > deny should have _only_ '.com' before '.my'. Sorry I don't get the correct
    > sintax to do it.
    >
    > Suggestions are welcome.


    Try this:

    def filter(adr): # note that "filter" is a builtin function also
    import re

    allow = re.compile(r'.*(?<!\.com)\.my(>|$)') # negative lookbehind
    deny = re.compile(r'.*\.com\.my(>|$)')
    cnt = 0
    if deny.search(adr): cnt += 1
    if allow.search(adr): cnt += 1
    return cnt


    HTH,
    Rob
     
    Rob Wolfe, Oct 18, 2006
    #3
  4. Fulvio

    Fulvio Guest

    Re: a little about regex

    ***********************
    Your mail has been scanned by InterScan MSS.
    ***********************


    On Wednesday 18 October 2006 16:43, Rob Wolfe wrote:

    > |def filter(adr):    # note that "filter" is a builtin function also
    > |    import re


    I didn't know it, but my function _is_ starting by underscore (a bit of
    localization :) )

    > |    allow = re.compile(r'.*(?<!\.com)\.my(>|$)')  # negative lookbehind
    > |    deny = re.compile(r'.*\.com\.my(>|$)')


    Great, it works perfectly. I found my errors.
    I didn't use r ahead of the patterns and i was close to the 'allow' pattern
    but didn't give positive result and KregexEditor reported wrong way. This
    specially because of '<' inside the stream. I thing that is not a normal
    regex input. It's only python valid. Am I right?

    More details are the previous thread.

    F
     
    Fulvio, Oct 18, 2006
    #4
  5. Fulvio

    Fulvio Guest

    ***********************
    Your mail has been scanned by InterScan MSS.
    ***********************


    On Wednesday 18 October 2006 15:32, Ron Adam wrote:

    > |Instead of using two separate if's, Use an if - elif and be sure to test


    Thank you, Ron, for the input :)
    I'll examine also in this mode. Meanwhile I had faced the total disaster :) of
    deleting all my emails from all server ;(
    (I've saved them locally, luckly :) )

    > |It's not exactly clear on what output you are seeking.  If you want 0 for
    > | not filtered and 1 for filtered, then look to Freds Hint.


    Actually the return code is like herein:

    if _filter(hdrs,allow,deny):
    # allow and deny are objects prepared by re.compile(pattern)
    _del(Num_of_Email)

    In short, it means unwanted to be deleted.
    And now the function is :

    def _filter(msg,al,dn):
    """ Filter try to classify a list of lines for a set of compiled
    patterns."""
    a = 0
    for hdrline in msg:
    # deny has the first priority and stop any further searching. Score 10
    #times
    if dn.search(hdrline): return len(msg) * 10
    if al.search(hdrline): return 0
    a += 1
    return a # it returns with a score of rejected matches or zero if none


    The patterns are taken from a configuration file. Those with Axx ='pattern'
    are allowing streams the others are Dxx to block under different criteria.
    Here they're :

    [Filters]
    A01 = ^From:.*\.it\b
    A02 = ^(To|Cc):.*frioio@
    A03 = ^(To|Cc):.*the_sting@
    A04 = ^(To|Cc):.*calm_me_or_die@
    A05 = ^(To|Cc):.*further@
    A06 = ^From:.*\.za\b
    D01 = ^From:.*\.co\.au\b
    D02 = ^Subject:.*\*\*\*SPAM\*\*\*

    *A bit of fake in order to get some privacy* :)
    I'm using configparser to fetch their value and they're are joint by :

    allow = re.compile('|'.join([k[1] for k in ifil if k[0] is 'a']))
    deny = re.compile('|'.join([k[1] for k in ifil if k[0] is 'd']))

    ifil is the input filter's section.

    At this point I suppose that I have realized the right thing, just I'm a bit
    curious to know if ithere's a better chance and realize a single regex
    compilation for all of the options.
    Basically the program will work, in term of filtering as per config and
    sincronizing with local $HOME/Mail/trash (configurable path). This last
    option will remove emails on the server for those that are in the local
    trash.
    Todo = backup local and remote emails for those filtered as good.
    multithread to connect all server in parallel
    SSL for POP3 and IMAP4 as well
    Actually I've problem on issuing the command to imap server to flag "Deleted"
    the message which count as spam. I only know the message details but what
    is the correct command is a bit obscure, for me.
    BTW whose Fred?

    F
     
    Fulvio, Oct 18, 2006
    #5
  6. Fulvio

    Ant Guest

    Re: a little about regex

    Rob Wolfe wrote:
    ....
    > def filter(adr): # note that "filter" is a builtin function also
    > import re
    >
    > allow = re.compile(r'.*(?<!\.com)\.my(>|$)') # negative lookbehind
    > deny = re.compile(r'.*\.com\.my(>|$)')
    > cnt = 0
    > if deny.search(adr): cnt += 1
    > if allow.search(adr): cnt += 1
    > return cnt


    Which makes the 'deny' code here redundant so in this case the function
    could be reduced to:

    import re

    def allow(adr): # note that "filter" is a builtin function also
    allow = re.compile(r'.*(?<!\.com)\.my(>|$)') # negative lookbehind
    if allow.search(adr):
    return True
    return False

    Though having the explicit allow and deny expressions may make what's
    going on clearer than the fairly esoteric negative lookbehind.
     
    Ant, Oct 18, 2006
    #6
  7. Fulvio

    Rob Wolfe Guest

    Re: a little about regex

    Fulvio wrote:

    > Great, it works perfectly. I found my errors.
    > I didn't use r ahead of the patterns and i was close to the 'allow' pattern
    > but didn't give positive result and KregexEditor reported wrong way. This
    > specially because of '<' inside the stream. I thing that is not a normal
    > regex input. It's only python valid. Am I right?


    The sequence inside "(?...)" is an extension notation specific to
    python.

    Regards,
    Rob
     
    Rob Wolfe, Oct 19, 2006
    #7
  8. Fulvio

    Fulvio Guest

    Re: a little about regex

    On Wednesday 18 October 2006 23:05, Ant wrote:
    >     allow = re.compile(r'.*(?<!\.com)\.my(>|$)')  # negative lookbehind
    >     if allow.search(adr):
    >         return True
    >     return False


    I'd point out that :
    allow = re.search(r'.*(?<!\.com)\.my(>|$)',adr)

    Will do as yours, since the call to 're' class will do the compilation as here
    it's doing separately.

    > Though having the explicit allow and deny expressions may make what's
    > going on clearer than the fairly esoteric negative lookbehind.


    This makes me think that your point is truly correct.
    The option for my case is meant as "deny all except those are specified".
    Also may go viceversa. Therefore I should refine the way the filtering act.
    In fact the (temporarily) ignored score is the base of the method to be
    applied.
    Obviously here mainly we are talking about email addresses, so my intention is
    like the mailfilter concept, which means the program may block an entire
    domain but some are allowed and all from ".my" are allowed but not those
    from ".com.my" (mostly annoying emails :p )

    At the sum of the view I've considered a flexible programming as much as I'm
    thinking that may be published some time to benefit for multiplatform user as
    python is.
    In such perspective I'm a bit curious to know if exist sites on the web where
    small program are welcomed and people like me can express all of their
    ignorance about the mode of using python. For such ignorance I may concour
    for the Nobel Price :)

    Also the News Group doesn't contemplate the idea to split into beginners and
    high level programmers (HLP). Of course the HLP are welcome to discuss on
    such NG :).

    F
     
    Fulvio, Oct 19, 2006
    #8
  9. Fulvio

    Ron Adam Guest

    Fulvio wrote:
    > ***********************
    > Your mail has been scanned by InterScan MSS.
    > ***********************
    >
    >
    > On Wednesday 18 October 2006 15:32, Ron Adam wrote:
    >
    >> |Instead of using two separate if's, Use an if - elif and be sure to test

    >
    > Thank you, Ron, for the input :)
    > I'll examine also in this mode. Meanwhile I had faced the total disaster :) of
    > deleting all my emails from all server ;(
    > (I've saved them locally, luckly :) )
    >
    >> |It's not exactly clear on what output you are seeking. If you want 0 for
    >> | not filtered and 1 for filtered, then look to Freds Hint.

    >
    > Actually the return code is like herein:
    >
    > if _filter(hdrs,allow,deny):
    > # allow and deny are objects prepared by re.compile(pattern)
    > _del(Num_of_Email)
    >
    > In short, it means unwanted to be deleted.
    > And now the function is :
    >
    > def _filter(msg,al,dn):
    > """ Filter try to classify a list of lines for a set of compiled
    > patterns."""
    > a = 0
    > for hdrline in msg:
    > # deny has the first priority and stop any further searching. Score 10
    > #times
    > if dn.search(hdrline): return len(msg) * 10
    > if al.search(hdrline): return 0
    > a += 1
    > return a # it returns with a score of rejected matches or zero if none


    I see, is this a cleanup script to remove the least wanted items?

    The allow/deny caused me to think it was more along the lines of a white/black
    list. Where as keep/discard would be terms more suitable to cleaning out items
    already allowed.

    Or is it a bit of both? Why the score?

    Just curious, I don't think I have any suggestions that will help in any
    specific ways.

    I would think the allow(keep?) filters would always have priority over deny filters.


    > The patterns are taken from a configuration file. Those with Axx ='pattern'
    > are allowing streams the others are Dxx to block under different criteria.
    > Here they're :
    >
    > [Filters]
    > A01 = ^From:.*\.it\b
    > A02 = ^(To|Cc):.*frioio@
    > A03 = ^(To|Cc):.*the_sting@
    > A04 = ^(To|Cc):.*calm_me_or_die@
    > A05 = ^(To|Cc):.*further@
    > A06 = ^From:.*\.za\b
    > D01 = ^From:.*\.co\.au\b
    > D02 = ^Subject:.*\*\*\*SPAM\*\*\*
    >
    > *A bit of fake in order to get some privacy* :)
    > I'm using configparser to fetch their value and they're are joint by :
    >
    > allow = re.compile('|'.join([k[1] for k in ifil if k[0] is 'a']))
    > deny = re.compile('|'.join([k[1] for k in ifil if k[0] is 'd']))
    >
    > ifil is the input filter's section.
    >
    > At this point I suppose that I have realized the right thing, just I'm a bit
    > curious to know if ithere's a better chance and realize a single regex
    > compilation for all of the options.


    I think keeping the allow filter seperate from the deny filter is good.

    You might be able to merge the header lines and run the filters across the whole
    header at once instead of each line.

    > Basically the program will work, in term of filtering as per config and
    > sincronizing with local $HOME/Mail/trash (configurable path). This last
    > option will remove emails on the server for those that are in the local
    > trash.
    > Todo = backup local and remote emails for those filtered as good.
    > multithread to connect all server in parallel
    > SSL for POP3 and IMAP4 as well
    > Actually I've problem on issuing the command to imap server to flag "Deleted"
    > the message which count as spam. I only know the message details but what
    > is the correct command is a bit obscure, for me.


    I can't help you here. Sorry.

    > BTW whose Fred?
    >
    > F


    Fredrik see...

    news://news.cox.net:119/
     
    Ron Adam, Oct 19, 2006
    #9
  10. Fulvio

    Fulvio Guest

    On Friday 20 October 2006 02:40, Ron Adam wrote:
    > I see, is this a cleanup script to remove the least wanted items?


    Yes. Probably will remain in this mode for a while.
    I'm not prepaired to bring out a new algorithm

    > Or is it a bit of both?  Why the score?


    As exposed on another post. There should be a way to define a deny/allow with
    some particular exception.( I.e deny all ".com" but not
    )

    > I would think the allow(keep?) filters would always have priority over deny
    > filters.


    It's a term which discerning capacity are involved. The previous post got this
    point up. I think to allow all ".uk" (let us say) but not "info.uk" (all
    reference are purely meant as example). Therefore if applying regex denial
    on ".info.uk" surely that doesn't match only ".uk".
    >


    > I think keeping the allow filter seperate from the deny filter is good.

    Agreed with you. Simply I was supposing the regex can do negative matching.

    > You might be able to merge the header lines and run the filters across the
    > whole header at once instead of each line.


    I got into this idea, which is good, I still need a bit of thinking to code
    it. It need to remember what will be the right separator between fields,
    otherwise may cause problems with different charset.

    > > Actually I've problem on issuing the command to imap server to flag
    > > "Deleted" the message which count as spam. I only know the message

    >
    > I can't help you here.  Sorry.


    Found it :), by try&fail.

    > > BTW whose Fred?

    >    
    > news://news.cox.net:119/


    I can't link foreigner NG than my isp giving me. I'm curious and I'll give it
    a try.

    F
     
    Fulvio, Oct 20, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ivan Marsh

    A little regex help?

    Ivan Marsh, Jul 3, 2003, in forum: Perl
    Replies:
    1
    Views:
    1,731
    JamesW
    Jul 4, 2003
  2. ThaDoctor
    Replies:
    3
    Views:
    393
    Alan Woodland
    Sep 28, 2007
  3. Replies:
    3
    Views:
    795
    Reedick, Andrew
    Jul 1, 2008
  4. Replies:
    0
    Views:
    144
  5. Daniel
    Replies:
    1
    Views:
    220
    Bart van Ingen Schenau
    Jul 9, 2013
Loading...

Share This Page