a more precise re for email addys

R

rbt

Is it possible to write an re that _only_ matches email addresses? I've
been googling around and have found several examples on the Web, but all
of them produce too many false positives... here are examples from
Google that I've experimented with:

re.compile('([\w\.\-]+@[\w\.\-]+)')
re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
re.compile('(\S+)@(\S+)')

All of these will find email addys, but they also find other things.
Could someone demonstrate how to write a more accurate re for emails?

BTW, this is not for spam, but like any tool could be used in a bad way.

Thanks!
 
S

skip

rbt> re.compile('([\w\.\-]+@[\w\.\-]+)')
rbt> re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
rbt> re.compile('(\S+)@(\S+)')

rbt> All of these will find email addys, but they also find other
rbt> things.

I think the only way to decide if your regular expression does what you want
is to provide a set of strings it must accept and another set which it must
reject. Supply those two sets and I'm sure any number of people here can
come up with a regular express that distinguishes the two sets.

Skip
 
G

Guest

* rbt said:
Is it possible to write an re that _only_ matches email addresses?

No. The only way to check if the matched thing is a mail address is to send
a mail and ask the supposed receiver whether he got it.

The grammar in RFC 2822 nearly matches anything with an @ in it. So, how
accurate your regex needs to be depends heavily on the context of the
usage. For example, my suggestion for web form checkers is always to just
look for an @ char and do the rest using the human component.

nd
 
R

rbt

Jim said:
There is a precise one in a Perl module, I believe.
http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html
Can you swipe that?

Jim

I can swipe it... but it causes my head to explode. I get unbalanced
paratheses errors when trying to make it work as a python re... it makes
more sense when broken up like this:

(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]
+(?:(?:(?:\r\n)... \000-\031]
+(?:(?:(?:\r\n)... \000-\031]
+(?:(?:(?:\r\n)... \000-\031]
....
....
 
D

dave.brueck

Does it really need to be a regular expression? Why not just write a
short function that breaks apart the input and validates each part?

def IsEmail(addr):
'Returns True if addr appears to be a valid email address'

# we don't allow stuff like foo@[email protected]
if addr.count('@') != 1:
return False
name, host = addr.split('@')

# verify the hostname (is an IP or has a valid TLD, etc.)
hostParts = host.split('.')
...

That way you'd have a nice, readable chunk of code that you could tweak
as needed (for example, maybe you'll find that the RFC is too liberal
so you'll end up needing to add additional rules to exclude "bad"
addresses).
 
R

rbt

Does it really need to be a regular expression? Why not just write a
short function that breaks apart the input and validates each part?

def IsEmail(addr):
'Returns True if addr appears to be a valid email address'

# we don't allow stuff like foo@[email protected]
if addr.count('@') != 1:
return False
name, host = addr.split('@')

# verify the hostname (is an IP or has a valid TLD, etc.)
hostParts = host.split('.')
...

That way you'd have a nice, readable chunk of code that you could tweak
as needed (for example, maybe you'll find that the RFC is too liberal
so you'll end up needing to add additional rules to exclude "bad"
addresses).

Good idea. I'll see what I can do with this. Thanks!
 
S

Steven D'Aprano

rbt> re.compile('([\w\.\-]+@[\w\.\-]+)')
rbt> re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
rbt> re.compile('(\S+)@(\S+)')

rbt> All of these will find email addys, but they also find other
rbt> things.

I think the only way to decide if your regular expression does what you want
is to provide a set of strings it must accept and another set which it must
reject. Supply those two sets and I'm sure any number of people here can
come up with a regular express that distinguishes the two sets.

Doesn't the relevent RFC state that the only way to
determine a valid email address is to send to it and
see if the mail server likes it?

I believe it explicitly warns against validating email
addresses, since you will invariably end up refusing to
accept some valid email addresses.
 
R

rbt

Does it really need to be a regular expression? Why not just write a
short function that breaks apart the input and validates each part?

def IsEmail(addr):
'Returns True if addr appears to be a valid email address'

# we don't allow stuff like foo@[email protected]
if addr.count('@') != 1:
return False
name, host = addr.split('@')

# verify the hostname (is an IP or has a valid TLD, etc.)
hostParts = host.split('.')
...

That way you'd have a nice, readable chunk of code that you could tweak
as needed (for example, maybe you'll find that the RFC is too liberal
so you'll end up needing to add additional rules to exclude "bad"
addresses).

Just to follow-up on this. I found that doing something such as this
along with a more generic RE that the results are much better. Thanks
for the idea!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top