Performance issue

T

Tom Carrick

Hi,

In my attempted learning of python, I've decided to recode an old
anagram solving program I made in C++. The C++ version runs in less
than a second, while the python takes 30 seconds. I'm not willing to
think it's just python being slow, so I was hoping someone could find
a faster way of doing this. Also, I was wondering if there was a more
builtin, or just nicer way of converting a string to a list (or using
the sort function on a list) than making a function for it.

The words.txt here is just a copy of FreeBSD's /usr/share/dict/words

Anyway, the code:

import string

# Need a function to convert a string to a list to be
# able to use the sort() function
def string2list(s):
l = []
for i in range(0, len(s)):
l.append(s)
return l

words = []
found = []

anagram = raw_input("Find anagrams of word: ")

f = open('words.txt', 'r')
file = f.read()
f.close()

words = file.splitlines()

sorted_anagram = anagram.lower()
sorted_anagram = string2list(anagram)
sorted_anagram.sort(lambda x, y: cmp(x, y))


while words:
if len(words[0]) == len(sorted_anagram):
wordlist = string2list(words[0])
wordlist.sort(lambda x, y: cmp(x, y))
sorted_wordlist = wordlist
if sorted_anagram == sorted_wordlist:
found.append(words[0])
del words[0]

print "Anagrams of " + anagram + ": "
while found:
print found[0] + " "
del found[0]
 
I

Irmen de Jong

Tom said:
Hi,

In my attempted learning of python, I've decided to recode an old
anagram solving program I made in C++. The C++ version runs in less
than a second, while the python takes 30 seconds. I'm not willing to
think it's just python being slow, so I was hoping someone could find
a faster way of doing this.

I like your attitude, not thinking that it's just Python that is slow :)
Also, I was wondering if there was a more
builtin, or just nicer way of converting a string to a list (or using
the sort function on a list) than making a function for it.

String to list: list("irmen") # --> ['i','r','m','e','n']
Sorted list: sorted("irmen") # --> ['e', 'i', 'm', 'n', 'r']
(the latter works in Python 2.4+)
The words.txt here is just a copy of FreeBSD's /usr/share/dict/words

Anyway, the code:

import string

# Need a function to convert a string to a list to be
# able to use the sort() function
def string2list(s):
l = []
for i in range(0, len(s)):
l.append(s)
return l


.... see above... just replace string2list(s) with sorted(s)
words = []
found = []

anagram = raw_input("Find anagrams of word: ")

f = open('words.txt', 'r')
file = f.read()
f.close()

Style: don't use 'file' as a variable name, you're hiding the
builtin 'file' function
words = file.splitlines()

You can obtain this list without reading the file in its entirety,
by using the readlines method of file objects:

words=open("words.txt").readlines()
sorted_anagram = anagram.lower()
sorted_anagram = string2list(anagram)
sorted_anagram.sort(lambda x, y: cmp(x, y))

The lambda is optional and only slows it down :)
But to get a sorted list of letters, just use sorted(s)
if you're on Python 2.4+
while words:
if len(words[0]) == len(sorted_anagram):
wordlist = string2list(words[0])
wordlist.sort(lambda x, y: cmp(x, y))
sorted_wordlist = wordlist

(same here.. replacing this by sorted(words[0]) probably
will speed it up rather significantly, partly because
it avoids the creation of those temporary lists)
if sorted_anagram == sorted_wordlist:
found.append(words[0])
del words[0]

print "Anagrams of " + anagram + ": "
while found:
print found[0] + " "
del found[0]

print " ".join(found)


Cheers

--Irmen de Jong
 
M

Marc 'BlackJack' Rintsch

Tom Carrick said:
[…] Also, I was wondering if there was a more
builtin, or just nicer way of converting a string to a list (or using
the sort function on a list) than making a function for it.

Use the `list()` builtin on the string and *just* the `sort()` method::

In [2]: characters = list('hello')

In [3]: characters
Out[3]: ['h', 'e', 'l', 'l', 'o']

In [4]: characters.sort()

In [5]: characters
Out[5]: ['e', 'h', 'l', 'l', 'o']
sorted_anagram = anagram.lower()
sorted_anagram = string2list(anagram)
sorted_anagram.sort(lambda x, y: cmp(x, y))

sorted_anagram = list(anagram.lower())
sorted_anagram.sort()
while words:
if len(words[0]) == len(sorted_anagram):
wordlist = string2list(words[0])
wordlist.sort(lambda x, y: cmp(x, y))
sorted_wordlist = wordlist
if sorted_anagram == sorted_wordlist:
found.append(words[0])
del words[0]

And here's the performance issue. Deleting the first element of a list
results in moving all remaining elements one index down. Better iterate
over the words in a for loop::

for word in words:
# use `word` instead of `word[0]` in the loop body.
...

Ciao,
Marc 'BlackJack' Rintsch
 
M

Marc 'BlackJack' Rintsch

You can obtain this list without reading the file in its entirety,
by using the readlines method of file objects:

words=open("words.txt").readlines()

This leaves the newline characters at the end of each line while
`str.splitlines()` removes them.

Ciao,
Marc 'BlackJack' Rintsch
 
V

vincent wehren

| Hi,
|
| In my attempted learning of python, I've decided to recode an old
| anagram solving program I made in C++. The C++ version runs in less
| than a second, while the python takes 30 seconds. I'm not willing to
| think it's just python being slow, so I was hoping someone could find
| a faster way of doing this. Also, I was wondering if there was a more
| builtin, or just nicer way of converting a string to a list (or using
| the sort function on a list) than making a function for it.
|
| The words.txt here is just a copy of FreeBSD's /usr/share/dict/words
|
| Anyway, the code:
|
| import string

You're importing string, but never use it, so you can omit that line.

|
| # Need a function to convert a string to a list to be
| # able to use the sort() function
| def string2list(s):
| l = []
| for i in range(0, len(s)):
| l.append(s)
| return l

No need to write your own function. list(s) already does the trick.

|
| words = []
| found = []
|
| anagram = raw_input("Find anagrams of word: ")
|
| f = open('words.txt', 'r')
| file = f.read()
| f.close()


I don't have a copy of words.txt, but here's what I would try
(untested):

anagram = raw_input("Find anagrams of word: ")
sorted_anagram = list(sorted(anagram.lower()))
# If you're Python is pre 2.4 ise
# sorted_anagram = list(anagram.lower())
# sorted_anagram.sort() #--sort list in place


found = []
# assuming "words.txt" contains a word per line
# iterate over the lines of the file

for line in open("/path/to/words.txt"):
word = line[:-1] # Get rid of trailing newline
sorted_word = list(sorted(word.lower()))
if sorted_word == sorted_anagram:
found.append(word)
if found:
print "Anagrams of %s:" % anagram
for w in found:
print w
else:
print "No anagrams for %s" % anagram
 
T

Thomas Rast

Tom Carrick said:
In my attempted learning of python, I've decided to recode an old
anagram solving program I made in C++. The C++ version runs in less
than a second, while the python takes 30 seconds.

Indeed, your program can be improved to run about ten times as fast,
which (on my system, with 96274 entries in /usr/share/dict/words) is
below a second.

In general you should try to move the loops into C code, i.e. use
built-in functions instead of long 'for' blocks.

Some comments on your version:
import string

# Need a function to convert a string to a list to be
# able to use the sort() function
def string2list(s): [snipped]

list() achieves the same thing a lot faster.
words = []

You do not need to initialize 'words' here, as you're overwriting it a
few lines afterwards.
found = []

anagram = raw_input("Find anagrams of word: ")

f = open('words.txt', 'r')
file = f.read()
f.close()

words = file.splitlines()

Try to avoid assigning to the names of built-in functions if you can.
Names like 'file', 'list', 'dict', 'map' etc. are often an obvious
choice, but overwriting them means that you don't "just" know what a
later use refers to.
sorted_anagram = anagram.lower()
sorted_anagram = string2list(anagram)
sorted_anagram.sort(lambda x, y: cmp(x, y))

Unless you *really* have to, don't use comparison functions with
sort(), as they slow the operation considerably. In this (as in most)
cases, a plain sorted_anagram.sort() does the trick, and in version
2.4 you can achieve custom sort orders with the optional 'key'
argument. The sorted() built-in also comes in handy here.
while words:
if len(words[0]) == len(sorted_anagram):
wordlist = string2list(words[0])
wordlist.sort(lambda x, y: cmp(x, y))
sorted_wordlist = wordlist
if sorted_anagram == sorted_wordlist:
found.append(words[0])
del words[0]

Avoid this style of looping at all times! Removing the first element
of a list is O(n), so looping through the whole list as above is
O(n**2). In most cases you should use a for loop:

for word in words:
# do something

which is O(n) of course. If you do have to loop destructively, pop()
from the end (which is the default) like so:

while words:
word = words.pop()
# do something

This is also O(n), because removing the *last* element of a list is
O(1) (amortized; I suppose the implementation will occasionally shrink
the underlying array at linear cost).
print "Anagrams of " + anagram + ": "
while found:
print found[0] + " "
del found[0]

I assume you meant not to print a newline between the words, which
'print' does by default. The best solution in that case is "
".join(found).

A better version (2.4+ only):

-- 8< -- 8< --
anagram = raw_input("Find anagrams of word: ")

words = open('words.txt', 'r')

sorted_anagram = sorted(anagram.lower())

found = []

for word in words.read().splitlines():
if len(word) == len(anagram) and sorted(word) == sorted_anagram:
found.append(word)

print "Anagrams of %s: %s" % (anagram, ' '.join(found))
-- >8 -- >8 --

Interestingly, the length comparison makes quite a difference! I
removed it at first, thinking it was unnecessary. Here are some
timings:

* Your original version (for comparison):

$ time echo stop | python2.4 anagram_slow.py
[...]
real 0m9.090s
user 0m8.790s
sys 0m0.013s

* Your version, but with the O(n**2) loop replaced by an O(n) 'for':

$ time echo stop | python2.4 anagram_forloop.py
[...]
real 0m0.221s
user 0m0.134s
sys 0m0.014s

* My version but with the length comparison removed:

$ time echo stop | python2.4 anagram_no_lencmp.py
[...]
real 0m0.408s
user 0m0.353s
sys 0m0.010s

* My version as above:

$ time echo stop | python2.4 anagram_fast.py
[...]
real 0m0.144s
user 0m0.099s
sys 0m0.008s

Hope that helps :)

- Thomas
 
S

stelios xanthakis

Scott said:
if __name__ == '__main__':
import sys
main(sys.argv[1:] or ['anagrams.py'])

This is *exactly* the kind of testcases I'm looking for to test
the soon-to-be-released pyvm. Great! I'll be back with results.


For now, a fast anagrams.py is
--------------------------------------
import sys

WORDS = [ i.rstrip () for i in file ('/usr/share/dict/words') ]

def findana (anagram):
sorted_anagram = sorted(anagram.lower())
len_anagram = len (anagram)
found = [ word for word in WORDS if len(word)==len_anagram and
sorted(word)==sorted_anagram ]
print "Anagrams of %s: %s" % (anagram, ' '.join(found))

for i in sys.argv [1:]:
findana (i)
-----------------------------------------


And timings....

time python anagram.pyc stop step words lots pool eat fast slow lamp
cold door xyzzy
Anagrams of stop: opts post pots spot stop tops
Anagrams of step: pest pets sept step
Anagrams of words: sword words
Anagrams of lots: lost lots slot
Anagrams of pool: loop polo pool
Anagrams of eat: ate eat tea
Anagrams of fast: fast fats
Anagrams of slow: lows owls slow
Anagrams of lamp: lamp palm
Anagrams of cold: clod cold
Anagrams of door: door odor
Anagrams of xyzzy:

real 0m1.491s
user 0m1.390s
sys 0m0.040s

time pyvm anagram.pyc stop step words lots pool eat fast slow lamp cold
door xyzzy
Anagrams of stop: opts post pots spot stop tops
Anagrams of step: pest pets sept step
Anagrams of words: sword words
Anagrams of lots: lost lots slot
Anagrams of pool: loop polo pool
Anagrams of eat: ate eat tea
Anagrams of fast: fast fats
Anagrams of slow: lows owls slow
Anagrams of lamp: lamp palm
Anagrams of cold: clod cold
Anagrams of door: door odor
Anagrams of xyzzy:

real 0m0.923s
user 0m0.760s
sys 0m0.070s
 
S

Scott David Daniels

Thomas said:
Indeed, your program can be improved to run about ten times as fast, ...
<great stuff>

This problem inspired an "all anagrams" program. Using it I was able
to find the largest anagram group in Shakespeare's first folio in about
the time you originally found anagrams for an individual word.

7: owers = rowse = sower = sowre = swore = woers = worse

====

def words(source):
for line in source:
for word in line.split():
yield word


def all_anagrams(words):
seen = dict()

for word in words:
word = word.lower()
if word not in seen:
dorw = ''.join(sorted(word))
try:
seen[dorw].append(word)
except KeyError:
seen[dorw] = [word]
if word == dorw:
continue
seen[word] = ()
for group in seen.itervalues():
if len(group) > 1:
yield -len(group), sorted(group) # conveniently sortable


def main(sources):
for filename in sources:
dictionary = open(filename, 'r')
print "All anagrams from %s:" % filename
try:
for nsize, group in sorted(all_anagrams(words(dictionary))):
print '%2s: %s' % (-nsize, ' = '.join(group))
finally:
dictionary.close()
print



if __name__ == '__main__':
import sys
main(sys.argv[1:] or ['anagrams.py'])
 
M

Marc 'BlackJack' Rintsch

Tom Carrick said:
In my attempted learning of python, I've decided to recode an old
anagram solving program I made in C++. The C++ version runs in less
than a second, while the python takes 30 seconds. I'm not willing to
think it's just python being slow, so I was hoping someone could find
a faster way of doing this. Also, I was wondering if there was a more
builtin, or just nicer way of converting a string to a list (or using
the sort function on a list) than making a function for it.

The words.txt here is just a copy of FreeBSD's /usr/share/dict/words

Here's my attempt which builds an anagram dictionary ("sorted word" ->
list of anagrams) for fast lookup of anagrams::

#!/usr/bin/env python2.4
from itertools import imap, ifilter

WORDS = '/usr/share/dict/words'

def make_anagram_map(words):
anagram_map = dict()
for word in imap(lambda w: w.strip().lower(), words):
sorted_word = ''.join(sorted(list(word)))
anagram_map.setdefault(sorted_word, list()).append(word)

return dict(ifilter(lambda x: len(x[1]) > 1, anagram_map.iteritems()))


def main():
words_file = open(WORDS, 'r')
anagram_map = make_anagram_map(words_file)
words_file.close()

while True:
word = raw_input('Find anagrams of word (just enter to end): ')
if not word:
break
try:
print anagram_map[''.join(sorted(list(word.strip().lower())))]
except KeyError:
print 'No anagrams found for %r' % word

# # Print all anagrams sorted by number of anagrams.
# print '\n'.join(map(str, sorted(anagram_map.values(), key=len)))
# print len(anagram_map)


if __name__ == '__main__':
main()


Ciao,
Marc 'BlackJack' Rintsch
 
S

Steven Bethard

Marc said:
def make_anagram_map(words):
anagram_map = dict()
for word in imap(lambda w: w.strip().lower(), words):
sorted_word = ''.join(sorted(list(word)))
anagram_map.setdefault(sorted_word, list()).append(word)

return dict(ifilter(lambda x: len(x[1]) > 1, anagram_map.iteritems()))

Or if you're afraid of map and filter like me, you can try:

def make_anagram_map(words):
anagram_map = {}
for word in (w.strip().lower() for w in words):
anagram_map.setdefault(''.join(sorted(word)), []).append(word)
return dict(sortedword_wordlist
for sortedword_wordlist in anagram_map.iteritems()
if len(sortedword_wordlist[1]) > 1)


py> make_anagram_map(['owers', 'pest', 'rowse', 'pets', 'sower', 'step'])
{'epst': ['pest', 'pets', 'step'], 'eorsw': ['owers', 'rowse', 'sower']}

STeVe
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top