I need some help with a regexp please

C

codefire

Hi,

I am trying to get a regexp to validate email addresses but can't get
it quite right. The problem is I can't quite find the regexp to deal
with ignoring the case (e-mail address removed), which is not valid. Here's
my attempt, neither of my regexps work quite how I want:

Code:
import os
import re

s = 'Hi [email protected] [email protected] [email protected] @@not
[email protected] partridge in a pear tree'
r = re.compile(r'\w+\.?\w+@[^@\s]+\.\w+')
#r = re.compile(r'[a-z\-\.]+@[a-z\-\.]+')

addys = set()
for a in r.findall(s):
    addys.add(a)

for a in sorted(addys):
    print a

This gives:
(e-mail address removed)
(e-mail address removed)
(e-mail address removed) <-- shouldn't be here :(
(e-mail address removed)

Nearly there but no cigar :)

I can't see the wood for the trees now :) Can anyone suggest a fix
please?

Thanks,
Tony
 
R

richard.charts

codefire said:
Hi,

I am trying to get a regexp to validate email addresses but can't get
it quite right. The problem is I can't quite find the regexp to deal
with ignoring the case (e-mail address removed), which is not valid. Here's
my attempt, neither of my regexps work quite how I want:

Code:
import os
import re

s = 'Hi [email protected] [email protected] [email protected] @@not
[email protected] partridge in a pear tree'
r = re.compile(r'\w+\.?\w+@[^@\s]+\.\w+')
#r = re.compile(r'[a-z\-\.]+@[a-z\-\.]+')

addys = set()
for a in r.findall(s):
addys.add(a)

for a in sorted(addys):
print a

This gives:
(e-mail address removed)
(e-mail address removed)
(e-mail address removed) <-- shouldn't be here :(
(e-mail address removed)

Nearly there but no cigar :)

I can't see the wood for the trees now :) Can anyone suggest a fix
please?

Thanks,
Tony

'[\w.]+@\w+(\.\w+)*'
Works for me, and SHOULD for you, but I haven't tested it all that
much.
Good luck.
 
N

Neil Cerutti

I am trying to get a regexp to validate email addresses but
can't get it quite right. The problem is I can't quite find the
regexp to deal with ignoring the case (e-mail address removed),
which is not valid. Here's my attempt, neither of my regexps
work quite how I want:

I suggest a websearch for email address validators instead of
writing of your own.

Here's a hit that looks useful:

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66439
 
S

Steve Holden

codefire said:
Hi,

I am trying to get a regexp to validate email addresses but can't get
it quite right. The problem is I can't quite find the regexp to deal
with ignoring the case (e-mail address removed), which is not valid. Here's
my attempt, neither of my regexps work quite how I want:

Code:
import os
import re

s = 'Hi [email protected] [email protected] [email protected] @@not
[email protected] partridge in a pear tree'
r = re.compile(r'\w+\.?\w+@[^@\s]+\.\w+')
#r = re.compile(r'[a-z\-\.]+@[a-z\-\.]+')

addys = set()
for a in r.findall(s):
addys.add(a)

for a in sorted(addys):
print a

This gives:
(e-mail address removed)
(e-mail address removed)
(e-mail address removed) <-- shouldn't be here :(
(e-mail address removed)

Nearly there but no cigar :)

I can't see the wood for the trees now :) Can anyone suggest a fix
please?
The problem is that your pattern doesn't start out by confirming that
it's either at the start of a line or after whitespace. You could do
this with a "look-behind assertion" if you wanted.

regards
Steve
 
C

codefire

Hi,

thanks for the advice guys.

Well took the kids swimming, watched some TV, read your hints and
within a few minutes had this:

r = re.compile(r'[^.\w]\w+\.?\w+@[^@\s]+\.\w+')

This works for me. That is if you have an invalid email such as
tony..bATblah.com it will reject it (note the double dots).

Anyway, now know a little more about regexps :)

Thanks again for the hints,

Tony
 
J

John Machin

codefire said:
Hi,

thanks for the advice guys.

Well took the kids swimming, watched some TV, read your hints and
within a few minutes had this:

r = re.compile(r'[^.\w]\w+\.?\w+@[^@\s]+\.\w+')

This works for me. That is if you have an invalid email such as
tony..bATblah.com it will reject it (note the double dots).

Anyway, now know a little more about regexps :)

A little more is unfortunately not enough. The best advice you got was
to use an existing e-mail address validator. The definition of a valid
e-mail address is complicated. You may care to check out "Mastering
Regular Expressions" by Jeffery Friedl. In the first edition, at least
(I haven't looked at the 2nd), he works through assembling a 4700+ byte
regex for validating e-mail addresses. Yes, that's 4KB. It's the best
advertisement for *not* using regexes for a task like that that I've
ever seen.

Cheers,
John
 
B

Ben Finney

John Machin said:
A little more is unfortunately not enough. The best advice you got was
to use an existing e-mail address validator. The definition of a valid
e-mail address is complicated. You may care to check out "Mastering
Regular Expressions" by Jeffery Friedl. In the first edition, at least
(I haven't looked at the 2nd), he works through assembling a 4700+ byte
regex for validating e-mail addresses. Yes, that's 4KB. It's the best
advertisement for *not* using regexes for a task like that that I've
ever seen.

The best advice I've seen when people ask "How do I validate whether
an email address is valid?" was "Try sending mail to it".

It's both Pythonic, and truly the best way. If you actually want to
confirm, don't try to validate it statically; *use* the email address,
and check the result. Send an email to that address, and don't use it
any further unless you get a reply saying "yes, this is the right
address to use" from the recipient.

The sending system's mail transport agent, not regular expressions,
determines which part is the domain to send the mail to.

The domain name system, not regular expressions, determines what
domains are valid, and what host should receive mail for that domain.

Most especially, the receiving mail system, not regular expressions,
determines what local-parts are valid.
 
S

Steve Holden

Ben said:
The best advice I've seen when people ask "How do I validate whether
an email address is valid?" was "Try sending mail to it".
That only applies if it's a likely-looking email address. If someone
asks me to send mail to "splurge.!#$%*&^from@thingie?><{}_)" I will
probably assume that it isn't worth my time trying.

If the email looks syntactically correct, *then* it's worth further
validation by trying a delivery attempt.
It's both Pythonic, and truly the best way. If you actually want to
confirm, don't try to validate it statically; *use* the email address,
and check the result. Send an email to that address, and don't use it
any further unless you get a reply saying "yes, this is the right
address to use" from the recipient.
This is a rather scatter-shot approach. Many possibilities can be
properly eliminated by judicious lexical checks before delivery is
considered.
The sending system's mail transport agent, not regular expressions,
determines which part is the domain to send the mail to.

The domain name system, not regular expressions, determines what
domains are valid, and what host should receive mail for that domain.

Most especially, the receiving mail system, not regular expressions,
determines what local-parts are valid.
Nevertheless, I am *not* going to try delivery to (for example) a
non-local address that doesn't contain an "at@ sign.

regards
Steve
 
B

Ben Finney

Steve Holden said:
That only applies if it's a likely-looking email address. If someone
asks me to send mail to "splurge.!#$%*&^from@thingie?><{}_)" I will
probably assume that it isn't worth my time trying.

You, as a human, can possibly make that decision, if you don't care
about turning away someone who *does* have such an email address. How
can an algorithm do so? There are many valid email addresses that look
as bizarre as the example you gave.
Nevertheless, I am *not* going to try delivery to (for example) a
non-local address that doesn't contain an "at@ sign.

Would you try delivery to an email address that contains two or more
"@" symbols? If not, you will be denying delivery to valid RFC2821
addresses.

This is, of course, something you're entitled to do. But you've then
consciously chosen not to use "is the email address valid?" as your
criterion, and the original request for such validation becomes moot.
 
A

Ant

John Machin wrote:
....
A little more is unfortunately not enough. The best advice you got was
to use an existing e-mail address validator.

We got bitten by this at the last place I worked - we were using a
regex email validator (from Microsoft IIRC), and we kept having
problems with specific email addresses from Ireland. There are stack of
Irish email addresses out there of the form paddy.o'reilly@domain -
perfectly valid email address, but doesn't satisfy the usual naive
versions of regex validators.

We use an even worse validator at my current job, but the feeling the
management have (not one I agree with) is that unusual email addresses,
whilst perhaps valid, are uncommon enough not to worry about....
 
A

Ant

Ben Finney wrote:
....
The best advice I've seen when people ask "How do I validate whether
an email address is valid?" was "Try sending mail to it".

There are advantages to the regex method. It is faster than sending an
email and getting a positive or negative return code. The delay may not
be acceptable in many applications. Secondly, the false negatives found
by a reasonable regex will be few compared to the number you'd get if
the smtp server went down, or a remote relay was having problems
delivering the message etc etc.
From a business point of view, it is probably more important to reduce
the number of false negatives than to reduce the number of false
positives - every false negative is a potential loss of a customer.
False positives? Who cares really as long as they are paying ;-)
 
J

John Machin

Ben said:
You, as a human, can possibly make that decision, if you don't care
about turning away someone who *does* have such an email address. How
can an algorithm do so? There are many valid email addresses that look
as bizarre as the example you gave.


Would you try delivery to an email address that contains two or more
"@" symbols? If not, you will be denying delivery to valid RFC2821
addresses.

This is, of course, something you're entitled to do. But you've then
consciously chosen not to use "is the email address valid?" as your
criterion, and the original request for such validation becomes moot.

What proportion of deliverable e-mail addresses have more than one @ in
them?
It may be a good idea, if the supplier of the e-mail address is a human
and is on-line, to run a plausibility check -- does it look like the
vast majority of addresses? Sure,
"(e-mail address removed)@[email protected]" may be valid and deliverable,
but "(e-mail address removed)@pastetwice.unorg" may be valid and
undeliverable. IMHO a quick "Please check and confirm" dialogue would
be warranted.

Cheers,
John
 
J

John Machin

Ant said:
John Machin wrote:
...

We got bitten by this at the last place I worked - we were using a
regex email validator (from Microsoft IIRC), and we kept having
problems with specific email addresses from Ireland. There are stack of
Irish email addresses out there of the form paddy.o'reilly@domain -
perfectly valid email address, but doesn't satisfy the usual naive
versions of regex validators.

We use an even worse validator at my current job, but the feeling the
management have (not one I agree with) is that unusual email addresses,
whilst perhaps valid, are uncommon enough not to worry about....

Oh, sorry for the abbreviation. "use" implies "source from believedly
reliable s/w source; test; then deploy" :)
 
B

Ben Finney

John Machin said:
What proportion of deliverable e-mail addresses have more than one @
in them?

I don't know. Fortunately, I don't need to; I don't "validate" email
addresses by regular expression.

What proportion of deliverable email addresses do you want to discard
as "not valid"?
 
J

John Machin

Ben said:
I don't know. Fortunately, I don't need to; I don't "validate" email
addresses by regular expression.

What proportion of deliverable email addresses do you want to discard
as "not valid"?

None. Re-read my post. I was suggesting suggesting an "are you sure" in
the case of weird or infrequent ones. Discarding wasn't mentioned.
 
S

Steve Holden

Ben said:
I don't know. Fortunately, I don't need to; I don't "validate" email
addresses by regular expression.

What proportion of deliverable email addresses do you want to discard
as "not valid"?
Just as a matter of interest, are you expecting that you'll find out
about the undeliverable ones? Because in many cases nowadays you wont,
since so many domains are filtering out "undeliverable mail" messages as
an anti-spam defence.

regards
Steve
 
A

Ant

John said:
....
Oh, sorry for the abbreviation. "use" implies "source from believedly
reliable s/w source; test; then deploy" :)

I actually meant that we got bitten by using a regex validator, not by
using an existing one. Though we did get bitten by an existing one, and
it being from Microsoft we should have known better ;-)
 
T

Tim Williams

Just as a matter of interest, are you expecting that you'll find out
about the undeliverable ones? Because in many cases nowadays you wont,
since so many domains are filtering out "undeliverable mail" messages as
an anti-spam defence.

....and then there is the problem of validating that the valid email
address belongs to the person entering it !! If it doesn't, any
correspondence you send to that email address will itself be spam (in
the greater modern definition of spam).

You could allow your form to accept any email address, then send a
verification in an email to the address given, asking the recipient
to click a link if they did in fact fill in the form. When they click
the link the details from the original form are then verified and can
be activated and processed.

HTH :)
 
B

Ben Finney

Steve Holden said:
Just as a matter of interest, are you expecting that you'll find out
about the undeliverable ones? Because in many cases nowadays you
wont, since so many domains are filtering out "undeliverable mail"
messages as an anti-spam defence.

I wouldn't expect a program to treat a user-supplied email address as
known-good until receiving a confirmation email with a cookie, or some
out-of-band confirmation (e.g., the email addresses are seeded by some
trusted source).

Until then, it's an untrusted piece of user-supplied data, to be kept
around for a limited time pending confirmation, and then discarded.
 
C

codefire

Yes, I didn't make it clear in my original post - the purpose of the
code was to learn something about regexps (I only started coding Python
last week). In terms of learning "a little more" the example was
successful. However, creating a full email validator is way beyond me -
the rules are far too complex!! :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top