Learning Python via a little word frequency program

Discussion in 'Python' started by Andrew Savige, Jan 9, 2008.

  1. I'm learning Python by reading David Beazley's "Python Essential Reference"
    book and writing a few toy programs. To get a feel for hashes and sorting,
    I set myself this little problem today (not homework, BTW):

    Given a string containing a space-separated list of names:

    names = "freddy fred bill jock kevin andrew kevin kevin jock"

    produce a frequency table of names, sorted descending by frequency.
    then ascending by name. For the above data, the output should be:

    kevin : 3
    jock : 2
    andrew : 1
    bill : 1
    fred : 1
    freddy : 1

    Here's my first attempt:

    names = "freddy fred bill jock kevin andrew kevin kevin jock"
    freq = {}
    for name in names.split():
    freq[name] = 1 + freq.get(name, 0)
    deco = zip([-x for x in freq.values()], freq.keys())
    deco.sort()
    for v, k in deco:
    print "%-10s: %d" % (k, -v)

    I'm interested to learn how more experienced Python folks would solve
    this little problem. Though I've read about the DSU Python sorting idiom,
    I'm not sure I've strictly applied it above ... and the -x hack above to
    achieve a descending sort feels a bit odd to me, though I couldn't think
    of a better way to do it.

    I also have a few specific questions. Instead of:

    for name in names.split():
    freq[name] = 1 + freq.get(name, 0)

    I might try:

    for name in names.split():
    try:
    freq[name] += 1
    except KeyError:
    freq[name] = 1

    Which is preferred?

    Ditto for:

    deco = zip([-x for x in freq.values()], freq.keys())

    versus:

    deco = zip(map(operator.neg, freq.values()), freq.keys())

    Finally, I might replace:

    for v, k in deco:
    print "%-10s: %d" % (k, -v)

    with:

    print "\n".join("%-10s: %d" % (k, -v) for v, k in deco)

    Any feedback on good Python style, performance tips, good books
    to read, etc. is appreciated.

    Thanks,
    /-\


    Make the switch to the world's best email. Get the new Yahoo!7 Mail now. www.yahoo7.com.au/worldsbestemail
     
    Andrew Savige, Jan 9, 2008
    #1
    1. Advertising

  2. Andrew Savige

    Ant Guest

    > I'm interested to learn how more experienced Python folks would solve
    > this little problem.


    I think I'd do the following:

    from collections import defaultdict

    names = "freddy fred bill jock kevin andrew kevin kevin jock"
    freq = defaultdict(lambda: 0)

    for name in names.split():
    freq[name] += 1

    pairs = [(v, k) for k, v in freq.iteritems()]

    for v, k in reversed(sorted(pairs)):
    print "%-10s: %d" % (k, v)


    defaultdict makes the frequency accumulation neater.

    reversed(sorted(pairs)) avoids the little -v hack and makes it more
    obvious what you are doing. Of course this could also be achieved by
    doing pairs.sort() and pairs.reverse() before iterating over the pairs
    list.

    Cheers,

    --
    Ant.
     
    Ant, Jan 9, 2008
    #2
    1. Advertising

  3. Andrew Savige

    Peter Otten Guest

    Andrew Savige wrote:

    > I'm learning Python by reading David Beazley's "Python Essential
    > Reference" book and writing a few toy programs. To get a feel for hashes
    > and sorting, I set myself this little problem today (not homework, BTW):
    >
    > Given a string containing a space-separated list of names:
    >
    > names = "freddy fred bill jock kevin andrew kevin kevin jock"
    >
    > produce a frequency table of names, sorted descending by frequency.
    > then ascending by name. For the above data, the output should be:
    >
    > kevin : 3
    > jock : 2
    > andrew : 1
    > bill : 1
    > fred : 1
    > freddy : 1
    >
    > Here's my first attempt:
    >
    > names = "freddy fred bill jock kevin andrew kevin kevin jock" freq = {}
    > for name in names.split():
    > freq[name] = 1 + freq.get(name, 0)
    > deco = zip([-x for x in freq.values()], freq.keys()) deco.sort() for v,
    > k in deco:
    > print "%-10s: %d" % (k, -v)
    >
    > I'm interested to learn how more experienced Python folks would solve
    > this little problem. Though I've read about the DSU Python sorting
    > idiom, I'm not sure I've strictly applied it above ... and the -x hack
    > above to achieve a descending sort feels a bit odd to me, though I
    > couldn't think of a better way to do it.


    You can specify a reverse sort with

    deco.sort(reverse=True)

    Newer versions of Python have the whole idiom built in:

    >>> items = freq.items()
    >>> from operator import itemgetter
    >>> items.sort(key=itemgetter(1), reverse=True)
    >>> for item in items:

    .... print "%-10s: %d" % item
    ....
    kevin : 3
    jock : 2
    bill : 1
    andrew : 1
    fred : 1
    freddy : 1

    You can pass an arbitrary function as key. itemgetter(1) is equivalent to

    def key(item): return item[1]

    > I also have a few specific questions. Instead of:
    >
    > for name in names.split():
    > freq[name] = 1 + freq.get(name, 0)
    >
    > I might try:
    >
    > for name in names.split():
    > try:
    > freq[name] += 1
    > except KeyError:
    > freq[name] = 1
    >
    > Which is preferred?


    I have no strong opinion about that. Generally speaking try...except is
    faster when you have many hits, i. e. the except suite is rarely invoked.
    Starting with Python 2.5 you can alternatively use

    from collections import defaultdict
    freq = defaultdict(int)
    for name in names.split():
    freq[name] += 1

    > Ditto for:
    >
    > deco = zip([-x for x in freq.values()], freq.keys())
    >
    > versus:
    >
    > deco = zip(map(operator.neg, freq.values()), freq.keys())


    I think the list comprehension is slightly more readable.

    > Finally, I might replace:
    >
    > for v, k in deco:
    > print "%-10s: %d" % (k, -v)
    >
    > with:
    >
    > print "\n".join("%-10s: %d" % (k, -v) for v, k in deco)


    Again, I find the explicit for loop more readable, but sometimes use the
    genexp, too.

    Peter
     
    Peter Otten, Jan 9, 2008
    #3
  4. Andrew Savige a écrit :
    > I'm learning Python by reading David Beazley's "Python Essential Reference"
    > book and writing a few toy programs. To get a feel for hashes and sorting,
    > I set myself this little problem today (not homework, BTW):
    >
    > Given a string containing a space-separated list of names:
    >
    > names = "freddy fred bill jock kevin andrew kevin kevin jock"
    >
    > produce a frequency table of names, sorted descending by frequency.
    > then ascending by name. For the above data, the output should be:
    >
    > kevin : 3
    > jock : 2
    > andrew : 1
    > bill : 1
    > fred : 1
    > freddy : 1
    >
    > Here's my first attempt:
    >
    > names = "freddy fred bill jock kevin andrew kevin kevin jock"
    > freq = {}
    > for name in names.split():
    > freq[name] = 1 + freq.get(name, 0)
    > deco = zip([-x for x in freq.values()], freq.keys())
    > deco.sort()
    > for v, k in deco:
    > print "%-10s: %d" % (k, -v)
    >
    > I'm interested to learn how more experienced Python folks would solve
    > this little problem.


    For a one-shot Q&D script:

    names = "freddy fred bill jock kevin andrew kevin kevin jock"
    freqs = [(- names.count(name), name) for name in set(names.split())]
    print "\n".join("%-10s : %s" % (n, -f) for f, n in sorted(freqs))


    Now I might choose a very different solution for a more serious
    application, depending on detailed specs and intended use of the
    "frequency table".

    > Though I've read about the DSU Python sorting idiom,
    > I'm not sure I've strictly applied it above ...


    Perhaps not "strictly" since you don't really "undecorate", but that's
    another application of the same principle : provided the appropriate
    data structure, sort() (or sorted()) will do the right thing.


    > and the -x hack above to
    > achieve a descending sort feels a bit odd to me, though I couldn't think
    > of a better way to do it.


    The "other" way would be to pass a custom comparison callback to sort,
    which would be both slower and more complicated. Your solution is IMHO
    the right thing to do here.

    > I also have a few specific questions. Instead of:
    >
    > for name in names.split():
    > freq[name] = 1 + freq.get(name, 0)
    >
    > I might try:
    >
    > for name in names.split():
    > try:
    > freq[name] += 1
    > except KeyError:
    > freq[name] = 1


    or a couple other solutions, including a defaultdict (python >= 2.5).

    > Which is preferred?


    It's a FAQ - or it should be one. Globally: the second one tends to be
    faster when there's no exception (ie the key already exists), but slower
    when exceptions happen. So it mostly depends on what you expect your
    dataset to be.

    Now note that you don't necessarily need a dict here !-)

    > Ditto for:
    >
    > deco = zip([-x for x in freq.values()], freq.keys())
    >
    > versus:
    >
    > deco = zip(map(operator.neg, freq.values()), freq.keys())


    As far as I'm concerned, I'd favor the first solution here. Reads better
    IMHO

    > Finally, I might replace:
    >
    > for v, k in deco:
    > print "%-10s: %d" % (k, -v)
    >
    > with:
    >
    > print "\n".join("%-10s: %d" % (k, -v) for v, k in deco)


    That's what I'd do here too - but it depends on context (ie: for huge
    datasets and/or complex formating, i'd use a for loop).
     
    Bruno Desthuilliers, Jan 9, 2008
    #4
  5. Ant a écrit :
    >> I'm interested to learn how more experienced Python folks would solve
    >> this little problem.

    >
    > I think I'd do the following:
    >
    > from collections import defaultdict
    >
    > names = "freddy fred bill jock kevin andrew kevin kevin jock"
    > freq = defaultdict(lambda: 0)
    >
    > for name in names.split():
    > freq[name] += 1
    >
    > pairs = [(v, k) for k, v in freq.iteritems()]
    >
    > for v, k in reversed(sorted(pairs)):
    > print "%-10s: %d" % (k, v)
    >
    >
    > defaultdict makes the frequency accumulation neater.
    >
    > reversed(sorted(pairs)) avoids the little -v hack and makes it more
    > obvious what you are doing.


    But fails to implement the specs (emphasis is mine):
    """
    produce a frequency table of names, sorted descending by frequency.
    *then ascending by name*. For the above data, the output should be:

    kevin : 3
    jock : 2
    andrew : 1
    bill : 1
    fred : 1
    freddy : 1
    """

    With your solution, you get:

    kevin : 3
    jock : 2
    freddy : 1
    fred : 1
    bill : 1
    andrew : 1
     
    Bruno Desthuilliers, Jan 9, 2008
    #5
  6. Andrew Savige

    MRAB Guest

    On Jan 9, 12:19 pm, Bruno Desthuilliers <bruno.
    > wrote:
    > Andrew Savige a écrit :
    >
    >
    >
    > > I'm learning Python by reading David Beazley's "Python Essential Reference"
    > > book and writing a few toy programs. To get a feel for hashes and sorting,
    > > I set myself this little problem today (not homework, BTW):

    >
    > > Given a string containing a space-separated list of names:

    >
    > > names = "freddy fred bill jock kevin andrew kevin kevin jock"

    >
    > > produce a frequency table of names, sorted descending by frequency.
    > > then ascending by name. For the above data, the output should be:

    >
    > > kevin : 3
    > > jock : 2
    > > andrew : 1
    > > bill : 1
    > > fred : 1
    > > freddy : 1

    >
    > > Here's my first attempt:

    >
    > > names = "freddy fred bill jock kevin andrew kevin kevin jock"
    > > freq = {}
    > > for name in names.split():
    > > freq[name] = 1 + freq.get(name, 0)
    > > deco = zip([-x for x in freq.values()], freq.keys())
    > > deco.sort()
    > > for v, k in deco:
    > > print "%-10s: %d" % (k, -v)

    >
    > > I'm interested to learn how more experienced Python folks would solve
    > > this little problem.

    >
    > For a one-shot Q&D script:
    >
    > names = "freddy fred bill jock kevin andrew kevin kevin jock"
    > freqs = [(- names.count(name), name) for name in set(names.split())]
    > print "\n".join("%-10s : %s" % (n, -f) for f, n in sorted(freqs))
    >

    [snip]
    That actually prints:

    kevin : 3
    fred : 2
    jock : 2
    andrew : 1
    bill : 1
    freddy : 1

    It says that "fred" occurs twice because of "freddy".

    names = "freddy fred bill jock kevin andrew kevin kevin jock"
    name_list = names.split()
    freqs = [(- name_list.count(name), name) for name in set(name_list)]
    print "\n".join("%-10s : %s" % (n, -f) for f, n in sorted(freqs))
     
    MRAB, Jan 9, 2008
    #6
  7. MRAB a écrit :
    > On Jan 9, 12:19 pm, Bruno Desthuilliers <bruno.
    > > wrote:

    (snip)
    > That actually prints:
    >
    > kevin : 3
    > fred : 2
    > jock : 2
    > andrew : 1
    > bill : 1
    > freddy : 1
    >
    > It says that "fred" occurs twice because of "freddy".


    oops ! My bad, didn't spot that one :(

    Thanks for pointing this out.
     
    Bruno Desthuilliers, Jan 10, 2008
    #7
  8. Andrew Savige

    rent Guest

    import collections

    names = "freddy fred bill jock kevin andrew kevin kevin jock"
    freq = collections.defaultdict(int)
    for name in names.split():
    freq[name] += 1
    keys = freq.keys()
    keys.sort(key = freq.get, reverse = True)
    for k in keys:
    print "%-10s: %d" % (k, freq[k])

    On Jan 9, 6:58 pm, Andrew Savige <> wrote:
    > I'm learning Python by reading David Beazley's "Python Essential Reference"
    > book and writing a few toy programs. To get a feel for hashes and sorting,
    > I set myself this little problem today (not homework, BTW):
    >
    > Given a string containing a space-separated list of names:
    >
    > names = "freddy fred bill jock kevin andrew kevin kevin jock"
    >
    > produce a frequency table of names, sorted descending by frequency.
    > then ascending by name. For the above data, the output should be:
    >
    > kevin : 3
    > jock : 2
    > andrew : 1
    > bill : 1
    > fred : 1
    > freddy : 1
    >
    > Here's my first attempt:
    >
    > names = "freddy fred bill jock kevin andrew kevin kevin jock"
    > freq = {}
    > for name in names.split():
    > freq[name] = 1 + freq.get(name, 0)
    > deco = zip([-x for x in freq.values()], freq.keys())
    > deco.sort()
    > for v, k in deco:
    > print "%-10s: %d" % (k, -v)
    >
    > I'm interested to learn how more experienced Python folks would solve
    > this little problem. Though I've read about the DSU Python sorting idiom,
    > I'm not sure I've strictly applied it above ... and the -x hack above to
    > achieve a descending sort feels a bit odd to me, though I couldn't think
    > of a better way to do it.
    >
    > I also have a few specific questions. Instead of:
    >
    > for name in names.split():
    > freq[name] = 1 + freq.get(name, 0)
    >
    > I might try:
    >
    > for name in names.split():
    > try:
    > freq[name] += 1
    > except KeyError:
    > freq[name] = 1
    >
    > Which is preferred?
    >
    > Ditto for:
    >
    > deco = zip([-x for x in freq.values()], freq.keys())
    >
    > versus:
    >
    > deco = zip(map(operator.neg, freq.values()), freq.keys())
    >
    > Finally, I might replace:
    >
    > for v, k in deco:
    > print "%-10s: %d" % (k, -v)
    >
    > with:
    >
    > print "\n".join("%-10s: %d" % (k, -v) for v, k in deco)
    >
    > Any feedback on good Python style, performance tips, good books
    > to read, etc. is appreciated.
    >
    > Thanks,
    > /-\
    >
    > Make the switch to the world's best email. Get the new Yahoo!7 Mail now.www.yahoo7.com.au/worldsbestemail
     
    rent, Jan 11, 2008
    #8
  9. Andrew Savige

    Paul Rubin Guest

    rent <> writes:
    > keys = freq.keys()
    > keys.sort(key = freq.get, reverse = True)
    > for k in keys:
    > print "%-10s: %d" % (k, freq[k])


    I prefer (untested):

    def snd((x,y)): return y # I wish this was built-in
    sorted_freq = sorted(freq.iteritems(), key=snd, reverse=True)
    for k,f in sorted_freq:
    print "%-10s: %d" % (k, f)
     
    Paul Rubin, Jan 11, 2008
    #9
  10. Andrew Savige

    Mike Meyer Guest

    On 11 Jan 2008 03:50:53 -0800 Paul Rubin <"http://phr.cx"@NOSPAM.invalid> wrote:

    > rent <> writes:
    > > keys = freq.keys()
    > > keys.sort(key = freq.get, reverse = True)
    > > for k in keys:
    > > print "%-10s: %d" % (k, freq[k])

    >
    > I prefer (untested):
    >
    > def snd((x,y)): return y # I wish this was built-in


    What's wrong with operator.itemgetter?

    > sorted_freq = sorted(freq.iteritems(), key=snd, reverse=True)


    (still untested)

    from operator import itemgetter
    sorted_freq = sorted(freq.iteritems(), key=itemgetter(2), reverse=True)

    <mike

    --
    Mike Meyer <> http://www.mired.org/consulting.html
    Independent Network/Unix/Perforce consultant, email for more information.
     
    Mike Meyer, Jan 11, 2008
    #10
  11. Mike Meyer <> writes:

    > On 11 Jan 2008 03:50:53 -0800 Paul Rubin <"http://phr.cx"@NOSPAM.invalid> wrote:
    >
    >> rent <> writes:
    >> > keys = freq.keys()
    >> > keys.sort(key = freq.get, reverse = True)
    >> > for k in keys:
    >> > print "%-10s: %d" % (k, freq[k])

    >>
    >> I prefer (untested):
    >>
    >> def snd((x,y)): return y # I wish this was built-in

    >
    > What's wrong with operator.itemgetter?
    >
    >> sorted_freq = sorted(freq.iteritems(), key=snd, reverse=True)

    >
    > (still untested)
    >
    > from operator import itemgetter
    > sorted_freq = sorted(freq.iteritems(), key=itemgetter(2), reverse=True)


    It should be itemgetter(1). See how easy it is to get it wrong? :)


    (Okay, this was too easy a shot to miss out on; I actually like
    itemgetter.)
     
    Hrvoje Niksic, Jan 11, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kevin
    Replies:
    16
    Views:
    8,251
    Dale King
    Apr 19, 2005
  2. Gordon Odell

    Looking for a word frequency counter.

    Gordon Odell, Feb 15, 2006, in forum: HTML
    Replies:
    2
    Views:
    487
    Stefan B Rusynko
    Feb 16, 2006
  3. x1
    Replies:
    9
    Views:
    319
    Rick DeNatale
    Oct 12, 2006
  4. PerlFAQ Server
    Replies:
    0
    Views:
    209
    PerlFAQ Server
    Feb 1, 2011
  5. PerlFAQ Server
    Replies:
    0
    Views:
    200
    PerlFAQ Server
    Mar 26, 2011
Loading...

Share This Page