Newbie here... getting a count of repeated instances in a list.

A

Amy G

I started trying to learn python today. The program I am trying to write
will open a text file containing email addresses and store them in a list.
Then it will go through them saving only the domain portion of the email.
After that it will count the number of times the domain occurs, and if above
a certain threshhold, it will add that domain to a list or text file, or
whatever. For now I just have it printing to the screen.

This is my code, and it works and does what I want. But I want to do
something with hash object to make this go a whole lot faster. Any
suggestions are appreciated a great deal.

Thanks,
Amy

ps. Sorry about the long post. Just really need some help here.


CODE
************************
file = open(sys.argv[1], 'r') # Opens up file containing emails
mail_list = file.readlines() # and sets the contents into a
list

def get_domains(email_list): # This function takes list of emails
and returns the domains only
domain_list = email_list
line_count = 0
while line_count < len(email_list):
domain_list[line_count] =
email_list[line_count].split('@', 1)[1]
domain_list[line_count] =
email_list[line_count].strip()
return domain_list

def count_domains(domain_list): # Takes argument of a list of domains and
returns a list of domains that
counted_domains = 0 # occur more than <threshhold> number
of times
line_count = 0
domain_count = 0
threshhold = 10
while line_count < len(domain_list):
domain_count =
domain_list.count(domain_list[line_count])
if domain_count > threshhold:
r = 0
counted_domains.append(d)
while r < (domain_count -1):
# Remove all other instances of an email once counted
domain_list.remove(d)
r = r + 1
line_count = line_count + 1
return counted_domains


domains = get_domains(mail_list)
counted = count_domains(domains)
print counted

********************************************
 
P

Peter Otten

Amy said:
I started trying to learn python today. The program I am trying to write

Welcome to the worst programming language ... except all others :)
will open a text file containing email addresses and store them in a list.
Then it will go through them saving only the domain portion of the email.
After that it will count the number of times the domain occurs, and if
above a certain threshhold, it will add that domain to a list or text
file, or
whatever. For now I just have it printing to the screen.

This is my code, and it works and does what I want. But I want to do
something with hash object to make this go a whole lot faster. Any
suggestions are appreciated a great deal.

I think your code looks alright, just not very idiomatic, as one might
expect. I have tinkered with it a bit and came up with the version given
below. I hope the result is readable enough, so I have abused the comments
to give some hints regarding the language/library that you might find
useful. get_domains() creates a dictionary with domains as keys and the
number of occurences as values, e. g.

{"nowhere.com": 10, "elswhere.edu": 5}

The code for extracting the domain from a line is factored out in its own
function extract_domain(). Precautions have been taken to make the file
usable both as a library module and a stand-alone script.


import sys

def get_domains(lines):
"Generate a domain->frequency dict from lines"
domains = {}
# enumerate() is the pythonic equivalent for
# index = 0
# while index < len(alist):
# alist[index]
# index += 1
for lineno, line in enumerate(lines):
try:
domain = extract_domain(line)
except ValueError:
print >> sys.stderr, "IGNORING line %d: %r" % (lineno+1,
line.strip())
else:
# else in a try ... except ... else statement
# may look a bit strange at first, but ist
# really useful
domains[domain] = domains.get(domain, 0) + 1
return domains

def extract_domain(line):
"Remove the name part of an emal address"
try:
return line.split("@", 1)[1].strip()
except IndexError:
# in this short example, you could just catch
# the index error in get_domains; however, in the long run
# it pays for client code to always see the "right" exception
raise ValueError("Invalid email address format: %r" % line.strip())

def filter_domains(domains, threshold=10):
"The <threshold> most frequent domains in alphabetical order"
# below is a demo of a very popular construct
# called "list comprehension"
result = [domain for domain, freq in domains.iteritems() if freq >=
threshold]
# the list.sort() method returns None, so
# sorting may look a bit clumsy when
# you first encounter it
result.sort()
return result

# The __name__ == "__main__" test is a common idiom in Python.
# The code below is only executed if you run the script from the
# command line, but not if you import it into another module,
# thus allowing to use the above functions in other contexts.
if __name__ == "__main__":
# for proper handling of command line args, have a look at
# the optparse module
threshold = int(sys.argv[2])
# the file object is iterable, so in many
# cases you can avoid an intermediate list
# of the lines in a file
source = file(sys.argv[1])
try:
domain_histogram = get_domains(source)
finally:
# clean up behind you if something goes wrong in the
# try block
source.close()

print "domains with %d or more messages" % threshold
print "\n".join(filter_domains(domain_histogram, threshold))


Peter
 
P

Peter Otten

Peter said:
def filter_domains(domains, threshold=10):
"The <threshold> most frequent domains in alphabetical order"

Oops, above is an involuntary example of a bad docstring :-(
Should rather be

"Domains with threshold or more occurences in alphabetical (case-sensitive)
order"


Peter
 
A

Amy G

Thanks again for that code help. I was able to follow your comments to
understand what you were doing. However, I was wondering how I can print
out the number of instances from the dictionary.

I might like to know not only that they are over the threshold, but what
their actual count is.

Thanks,
AMY
 
P

Peter Otten

Amy said:
I might like to know not only that they are over the threshold, but what
their actual count is.

Generate a list of (domain, frequency) tuples:

def filter_domains2(domains, threshold=10):
result = [(d, f) for d, f in domains.iteritems() if f >= threshold]
result.sort()
return result

And use it like so:

for dom, freq in filter_domains2(domain_histogram, threshold):
print dom, "->", freq

Peter
 
A

Amy G

Okay. You rock Peter!!! If I can just enlist your help a little more...

I now have two dictionaries, each with domains and frequncies. I want to
have the domains which appear in both to be deleted from the first one.

Is there an easy way to do this?

Peter Otten said:
Amy said:
I might like to know not only that they are over the threshold, but what
their actual count is.

Generate a list of (domain, frequency) tuples:

def filter_domains2(domains, threshold=10):
result = [(d, f) for d, f in domains.iteritems() if f >= threshold]
result.sort()
return result

And use it like so:

for dom, freq in filter_domains2(domain_histogram, threshold):
print dom, "->", freq

Peter
 
P

Peter Otten

Amy said:
I now have two dictionaries, each with domains and frequncies. I want to
have the domains which appear in both to be deleted from the first one.

Is there an easy way to do this?
.... first.pop(dom, None)
....
20
Peter
 
A

Amy G

Not sure that I actually have two dictionaries. I am setting

domains_black = filter_domains2(domain_histogram, threshold)

I thought this would return a dictionary, but apparently it is a list.
How do I perform this operation on the dictionary?

I am going to have to think of some way to thank you for all of your help.
 
P

Peter Otten

Amy said:
Not sure that I actually have two dictionaries. I am setting

domains_black = filter_domains2(domain_histogram, threshold)

I thought this would return a dictionary, but apparently it is a list.
How do I perform this operation on the dictionary?

domains_black = dict(filter_domains2(domain_histogram, threshold))

filter_domains2() returns a list of (domain, frequency) tuples, and the
dict() constructor is quite happy with it.

Seriously, consider learning the language. I like the tutorial a lot, but
there are other online resources that you might investigate.

Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top