a more precise re for email addys

Discussion in 'Python' started by rbt, Jan 18, 2006.

  1. rbt

    rbt Guest

    Is it possible to write an re that _only_ matches email addresses? I've
    been googling around and have found several examples on the Web, but all
    of them produce too many false positives... here are examples from
    Google that I've experimented with:

    re.compile('([\w\.\-]+@[\w\.\-]+)')
    re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
    re.compile('(\S+)@(\S+)')

    All of these will find email addys, but they also find other things.
    Could someone demonstrate how to write a more accurate re for emails?

    BTW, this is not for spam, but like any tool could be used in a bad way.

    Thanks!
    rbt, Jan 18, 2006
    #1
    1. Advertising

  2. rbt

    Jim Guest

    Jim, Jan 18, 2006
    #2
    1. Advertising

  3. Todd Whiteman, Jan 18, 2006
    #3
  4. rbt

    Guest

    rbt> re.compile('([\w\.\-]+@[\w\.\-]+)')
    rbt> re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
    rbt> re.compile('(\S+)@(\S+)')

    rbt> All of these will find email addys, but they also find other
    rbt> things.

    I think the only way to decide if your regular expression does what you want
    is to provide a set of strings it must accept and another set which it must
    reject. Supply those two sets and I'm sure any number of people here can
    come up with a regular express that distinguishes the two sets.

    Skip
    , Jan 18, 2006
    #4
  5. rbt

    Guest

    , Jan 18, 2006
    #5
  6. * rbt wrote:

    > Is it possible to write an re that _only_ matches email addresses?


    No. The only way to check if the matched thing is a mail address is to send
    a mail and ask the supposed receiver whether he got it.

    The grammar in RFC 2822 nearly matches anything with an @ in it. So, how
    accurate your regex needs to be depends heavily on the context of the
    usage. For example, my suggestion for web form checkers is always to just
    look for an @ char and do the rest using the human component.

    nd
    --
    Already I've seen people (really!) write web URLs in the form:
    http:\\some.site.somewhere
    [...] How soon until greengrocers start writing "apples $1\pound"
    or something? -- Joona I Palaste in clc
    =?UTF-8?B?QW5kcsOp?= Malo, Jan 18, 2006
    #6
  7. rbt

    rbt Guest

    Jim wrote:
    > There is a precise one in a Perl module, I believe.
    > http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
    > Can you swipe that?
    >
    > Jim
    >


    I can swipe it... but it causes my head to explode. I get unbalanced
    paratheses errors when trying to make it work as a python re... it makes
    more sense when broken up like this:

    (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]
    +(?:(?:(?:\r\n)... \000-\031]
    +(?:(?:(?:\r\n)... \000-\031]
    +(?:(?:(?:\r\n)... \000-\031]
    ....
    ....
    rbt, Jan 18, 2006
    #7
  8. rbt

    Guest

    Does it really need to be a regular expression? Why not just write a
    short function that breaks apart the input and validates each part?

    def IsEmail(addr):
    'Returns True if addr appears to be a valid email address'

    # we don't allow stuff like foo@
    if addr.count('@') != 1:
    return False
    name, host = addr.split('@')

    # verify the hostname (is an IP or has a valid TLD, etc.)
    hostParts = host.split('.')
    ...

    That way you'd have a nice, readable chunk of code that you could tweak
    as needed (for example, maybe you'll find that the RFC is too liberal
    so you'll end up needing to add additional rules to exclude "bad"
    addresses).
    , Jan 18, 2006
    #8
  9. rbt

    rbt Guest

    wrote:
    > Does it really need to be a regular expression? Why not just write a
    > short function that breaks apart the input and validates each part?
    >
    > def IsEmail(addr):
    > 'Returns True if addr appears to be a valid email address'
    >
    > # we don't allow stuff like foo@
    > if addr.count('@') != 1:
    > return False
    > name, host = addr.split('@')
    >
    > # verify the hostname (is an IP or has a valid TLD, etc.)
    > hostParts = host.split('.')
    > ...
    >
    > That way you'd have a nice, readable chunk of code that you could tweak
    > as needed (for example, maybe you'll find that the RFC is too liberal
    > so you'll end up needing to add additional rules to exclude "bad"
    > addresses).
    >


    Good idea. I'll see what I can do with this. Thanks!
    rbt, Jan 18, 2006
    #9
  10. wrote:
    > rbt> re.compile('([\w\.\-]+@[\w\.\-]+)')
    > rbt> re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
    > rbt> re.compile('(\S+)@(\S+)')
    >
    > rbt> All of these will find email addys, but they also find other
    > rbt> things.
    >
    > I think the only way to decide if your regular expression does what you want
    > is to provide a set of strings it must accept and another set which it must
    > reject. Supply those two sets and I'm sure any number of people here can
    > come up with a regular express that distinguishes the two sets.


    Doesn't the relevent RFC state that the only way to
    determine a valid email address is to send to it and
    see if the mail server likes it?

    I believe it explicitly warns against validating email
    addresses, since you will invariably end up refusing to
    accept some valid email addresses.


    --
    Steven.
    Steven D'Aprano, Jan 19, 2006
    #10
  11. rbt

    rbt Guest

    wrote:
    > Does it really need to be a regular expression? Why not just write a
    > short function that breaks apart the input and validates each part?
    >
    > def IsEmail(addr):
    > 'Returns True if addr appears to be a valid email address'
    >
    > # we don't allow stuff like foo@
    > if addr.count('@') != 1:
    > return False
    > name, host = addr.split('@')
    >
    > # verify the hostname (is an IP or has a valid TLD, etc.)
    > hostParts = host.split('.')
    > ...
    >
    > That way you'd have a nice, readable chunk of code that you could tweak
    > as needed (for example, maybe you'll find that the RFC is too liberal
    > so you'll end up needing to add additional rules to exclude "bad"
    > addresses).
    >


    Just to follow-up on this. I found that doing something such as this
    along with a more generic RE that the results are much better. Thanks
    for the idea!
    rbt, Jan 19, 2006
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joseph
    Replies:
    2
    Views:
    908
    Ben Twijnstra
    May 8, 2005
  2. Vijay
    Replies:
    5
    Views:
    2,019
    Vijay
    Feb 27, 2005
  3. Moritz Beller
    Replies:
    2
    Views:
    516
    Chris Theis
    Sep 19, 2004
  4. Peng Yu
    Replies:
    1
    Views:
    287
    Lie Ryan
    Nov 24, 2009
  5. Benjamin Kaplan
    Replies:
    1
    Views:
    315
    Carl Banks
    Nov 24, 2009
Loading...

Share This Page