a more precise re for email addys

rbt · Jan 18, 2006

Is it possible to write an re that _only_ matches email addresses? I've
been googling around and have found several examples on the Web, but all
of them produce too many false positives... here are examples from
Google that I've experimented with:

re.compile('([\w\.\-]+@[\w\.\-]+)')
re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
re.compile('(\S+)@(\S+)')

All of these will find email addys, but they also find other things.
Could someone demonstrate how to write a more accurate re for emails?

BTW, this is not for spam, but like any tool could be used in a bad way.

Thanks!

Jim · Jan 18, 2006

There is a precise one in a Perl module, I believe.
http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
Can you swipe that?

Jim

Todd Whiteman · Jan 18, 2006

OMG, that is so ugly

skip · Jan 18, 2006

rbt> re.compile('([\w\.\-]+@[\w\.\-]+)')
rbt> re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
rbt> re.compile('(\S+)@(\S+)')

rbt> All of these will find email addys, but they also find other
rbt> things.

I think the only way to decide if your regular expression does what you want
is to provide a set of strings it must accept and another set which it must
reject. Supply those two sets and I'm sure any number of people here can
come up with a regular express that distinguishes the two sets.

Skip

skip · Jan 18, 2006

Jim> http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html

Maybe Cafe Express could be convinced to put that on a t-shirt...

Skip

Guest · Jan 18, 2006

* rbt said:
Is it possible to write an re that _only_ matches email addresses?

No. The only way to check if the matched thing is a mail address is to send
a mail and ask the supposed receiver whether he got it.

The grammar in RFC 2822 nearly matches anything with an @ in it. So, how
accurate your regex needs to be depends heavily on the context of the
usage. For example, my suggestion for web form checkers is always to just
look for an @ char and do the rest using the human component.

nd

rbt · Jan 18, 2006

Jim said:
There is a precise one in a Perl module, I believe.
http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
Can you swipe that?

Jim

I can swipe it... but it causes my head to explode. I get unbalanced
paratheses errors when trying to make it work as a python re... it makes
more sense when broken up like this:

(?

?:\r\n)?[ \t])*(?

?

?:[^()<>@,;:\\".\[\] \000-\031]
+(?

?

?:\r\n)... \000-\031]
+(?

?

?:\r\n)... \000-\031]
+(?

?

?:\r\n)... \000-\031]
....
....

dave.brueck · Jan 18, 2006

Does it really need to be a regular expression? Why not just write a
short function that breaks apart the input and validates each part?

def IsEmail(addr):
'Returns True if addr appears to be a valid email address'

# we don't allow stuff like foo@[email protected]
if addr.count('@') != 1:
return False
name, host = addr.split('@')

# verify the hostname (is an IP or has a valid TLD, etc.)
hostParts = host.split('.')
...

That way you'd have a nice, readable chunk of code that you could tweak
as needed (for example, maybe you'll find that the RFC is too liberal
so you'll end up needing to add additional rules to exclude "bad"
addresses).

rbt · Jan 18, 2006

Does it really need to be a regular expression? Why not just write a
short function that breaks apart the input and validates each part?

def IsEmail(addr):
'Returns True if addr appears to be a valid email address'

# we don't allow stuff like foo@[email protected]
if addr.count('@') != 1:
return False
name, host = addr.split('@')

# verify the hostname (is an IP or has a valid TLD, etc.)
hostParts = host.split('.')
...

That way you'd have a nice, readable chunk of code that you could tweak
as needed (for example, maybe you'll find that the RFC is too liberal
so you'll end up needing to add additional rules to exclude "bad"
addresses).

Good idea. I'll see what I can do with this. Thanks!

Steven D'Aprano · Jan 19, 2006

rbt> re.compile('([\w\.\-]+@[\w\.\-]+)')
rbt> re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
rbt> re.compile('(\S+)@(\S+)')

rbt> All of these will find email addys, but they also find other
rbt> things.

I think the only way to decide if your regular expression does what you want
is to provide a set of strings it must accept and another set which it must
reject. Supply those two sets and I'm sure any number of people here can
come up with a regular express that distinguishes the two sets.

Doesn't the relevent RFC state that the only way to
determine a valid email address is to send to it and
see if the mail server likes it?

I believe it explicitly warns against validating email
addresses, since you will invariably end up refusing to
accept some valid email addresses.

rbt · Jan 19, 2006

Does it really need to be a regular expression? Why not just write a
short function that breaks apart the input and validates each part?

def IsEmail(addr):
'Returns True if addr appears to be a valid email address'

# we don't allow stuff like foo@[email protected]
if addr.count('@') != 1:
return False
name, host = addr.split('@')

# verify the hostname (is an IP or has a valid TLD, etc.)
hostParts = host.split('.')
...

That way you'd have a nice, readable chunk of code that you could tweak
as needed (for example, maybe you'll find that the RFC is too liberal
so you'll end up needing to add additional rules to exclude "bad"
addresses).

Just to follow-up on this. I found that doing something such as this
along with a more generic RE that the results are much better. Thanks
for the idea!

spider, why isnt it finding the url?	1	May 23, 2008
How to send email programmatically from a gmail email a/c when port587(smtp) is blocked	5	Sep 11, 2012
I need some help with a regexp please	24	Sep 21, 2006
HOWTO: Parsing email using Python part2	1	Jul 15, 2011
HOWTO: Parsing email using Python part1	2	Jul 3, 2011
500 tracker orphans; we need more reviewers	0	Jun 19, 2010
Must be a bug in the re module [was: Why this result with the remodule]	0	Nov 3, 2010
Sending email from servlet	3	Apr 15, 2010

a more precise re for email addys

rbt

Jim

Todd Whiteman

skip

skip

Guest

rbt

dave.brueck

rbt

Steven D'Aprano

rbt

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads