Parsing for email addresses

G

galileo228

Hey all,

I'm trying to write python code that will open a textfile and find the
email addresses inside it. I then want the code to take just the
characters to the left of the "@" symbol, and place them in a list.
(So if (e-mail address removed) was in the file, 'galileo228' would be
added to the list.)

Any suggestions would be much appeciated!

Matt
 
J

Jonathan Gardner

I'm trying to write python code that will open a textfile and find the
email addresses inside it. I then want the code to take just the
characters to the left of the "@" symbol, and place them in a list.
(So if (e-mail address removed) was in the file, 'galileo228' would be
added to the list.)

Any suggestions would be much appeciated!

You may want to use regexes for this. For every match, split on '@'
and take the first bit.

Note that the actual specification for email addresses is far more
than a single regex can handle. However, for almost every single case
out there nowadays, a regex will get what you need.
 
T

Tim Chase

Jonathan said:
You may want to use regexes for this. For every match, split on '@'
and take the first bit.

Note that the actual specification for email addresses is far more
than a single regex can handle. However, for almost every single case
out there nowadays, a regex will get what you need.

You can even capture the part as you find the regexps. As
Jonathan mentions, finding RFC-compliant email addresses can be a
hairy/intractable problem. But you can get a pretty close
approximation:

import re

r = re.compile(r'([-\w._+]+)@(?:[-\w]+\.)+(?:\w{2,5})', re.I)
# ^
# if you want to allow local domains like
# user@localhost
# then change the "+" marked with the "^"
# to a "*" and the "{2,5}" to "+" to unlimit
# the TLD. This will change the outcome
# of the last test "jim@com" to True

for test, expected in (
('(e-mail address removed)', True),
('(e-mail address removed)', True),
('@example.com', False),
('@sub.example.com', False),
('@com', False),
('jim@com', False),
):
m = r.match(test)
if bool(m) ^ expected:
print "Failed: %r should be %s" % (test, expected)

emails = set()
for line in file('test.txt'):
for match in r.finditer(line):
emails.add(match.group(1))
print "All the emails:",
print ', '.join(emails)

-tkc
 
G

galileo228

Hey all, thanks as always for the quick responses.

I actually found a very simple way to do what I needed to do. In
short, I needed to take an email which had a large number of addresses
in the 'to' field, and place just the identifiers (everything to the
left of @domain.com), in a python list.

I simply highlighted all the addresses and placed them in a text file
called emails.txt. Then I had the following code which placed each
line in the file into the list 'names':

Code:
fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
names = fileHandle.readlines()

Now, the 'names' list has values looking like this: ['(e-mail address removed)
\n', '(e-mail address removed)\n', etc]. So I ran the following code:

Code:
for x in names:
    st_list.append(x.replace('@domain.com\n',''))

And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].

Obviously this only worked because all of the domain names were the
same. If they were not then based on your comments and my own
research, I would've had to use regex and the split(), which looked
massively complicated to learn.

Thanks all.

Matt
 
T

Tim Chase

galileo228 said:
Code:
fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
names = fileHandle.readlines()

Now, the 'names' list has values looking like this: ['(e-mail address removed)
\n', '(e-mail address removed)\n', etc]. So I ran the following code:

Code:
for x in names:
st_list.append(x.replace('@domain.com\n',''))

And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].

Obviously this only worked because all of the domain names were the
same. If they were not then based on your comments and my own
research, I would've had to use regex and the split(), which looked
massively complicated to learn.

The complexities stemmed from several factors that, with more
details, could have made the solutions less daunting:

(a) you mentioned "finding" the email addresses -- this makes
it sound like there's other junk in the file that has to be
sifted through to find "things that look like an email address".
If the sole content of the file is lines containing only email
addresses, then "find the email address" is a bit like [1]

(b) you omitted the detail that the domains are all the same.
Even if they're not the same, (a) reduces the problem to a much
easier task:

s = set()
for line in file('results.txt'):
s.add(line.rsplit('@', 1)[0].lower())
print s

If it was previously a CSV or tab-delimited file, Python offers
batteries-included processing to make it easy:

import csv
f = file('results.txt', 'rb')
r = csv.DictReader(f) # CSV
# r = csv.DictReader(f, delimiter='\t') # tab delim
s = set()
for row in r:
s.add(row['Email'].lower())
f.close()

or even

f = file(...)
r = csv.DictReader(...)
s = set(row['Email'].lower() for row in r)
f.close()

Hope this gives you more ideas to work with.

-tkc

[1]
http://jacksmix.files.wordpress.com/2007/05/findx.jpg
 
G

galileo228

Tim -

Thanks for this. I actually did intend to have to sift through other
junk in the file, but then figured I could just cut and paste emails
directly from the 'to' field, thus making life easier.

Also, in this particular instance, the domain names were the same, and
thus I was able to figure out my solution, but I do need to know how
to handle the same situation when the domain names are different, so
your response was most helpful.

Apologies for leaving out some details.

Matt

galileo228 said:
Code:
fileHandle = open('/Users/Matt/Documents/python/results.txt','r')
names = fileHandle.readlines()
Now, the 'names' list has values looking like this: ['(e-mail address removed)
\n', '(e-mail address removed)\n', etc]. So I ran the following code:
Code:
for x in names:
    st_list.append(x.replace('[email protected]\n',''))
And that did the trick! 'Names' now has ['aaa12', 'bbb34', etc].
Obviously this only worked because all of the domain names were the
same. If they were not then based on your comments and my own
research, I would've had to use regex and the split(), which looked
massively complicated to learn.

The complexities stemmed from several factors that, with more
details, could have made the solutions less daunting:

   (a) you mentioned "finding" the email addresses -- this makes
it sound like there's other junk in the file that has to be
sifted through to find "things that look like an email address".
If the sole content of the file is lines containing only email
addresses, then "find the email address" is a bit like [1]

   (b) you omitted the detail that the domains are all the same.
  Even if they're not the same, (a) reduces the problem to a much
easier task:

   s = set()
   for line in file('results.txt'):
     s.add(line.rsplit('@', 1)[0].lower())
   print s

If it was previously a CSV or tab-delimited file, Python offers
batteries-included processing to make it easy:

   import csv
   f = file('results.txt', 'rb')
   r = csv.DictReader(f)  # CSV
   # r = csv.DictReader(f, delimiter='\t') # tab delim
   s = set()
   for row in r:
     s.add(row['Email'].lower())
   f.close()

or even

   f = file(...)
   r = csv.DictReader(...)
   s = set(row['Email'].lower() for row in r)
   f.close()

Hope this gives you more ideas to work with.

-tkc

[1]http://jacksmix.files.wordpress.com/2007/05/findx.jpg
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,527
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top