Suggest more finesse, please. I/O and sequences.

Q

Qertoip

Would you like to suggest me any improvements for the following code?
I want to make my implementation as simple, as Python - native, as fine as
possible.

I've written simple code, which reads input text file and creates words'
ranking by number of appearence.

Code:
---------------------------------------------------------------------------
import sys

def moreCommonWord( x, y ):
if x[1] != y[1]:
return cmp( x[1], y[1] ) * -1
return cmp( x[0], y[0] )

wordsDic = {}
inFile = open( sys.argv[1] )
for word in inFile.read().split():
if wordsDic.has_key( word ):
wordsDic[word] = wordsDic[word] + 1
else:
wordsDic[word] = 1
inFile.close()

wordsLst = wordsDic.items()
wordsLst.sort( moreCommonWord )

outFile = open( sys.argv[2], 'w')
for pair in wordsLst:
outFile.write( str( pair[1] ).rjust( 7 ) + " : " + str( pair[0] ) + "\n" )
outFile.close()
---------------------------------------------------------------------------

In particular, I don't like reading whole file just to split it.
It is easy to read by lines - may I read by words with that ease?

PS I've been learning Python since todays morning, so be understanding :>
 
S

Scott David Daniels

Qertoip said:
import sys

def moreCommonWord( x, y ):
if x[1] != y[1]:
return cmp( x[1], y[1] ) * -1
return cmp( x[0], y[0] )

If you want to keep this, use:

def moreCommonWord(x, y):
if x[1] != y[1]:
return cmp(y[1], x[1])
return cmp(x[0], y[0])

....
I don't like type-based names (Charles Simonyi never convinced me), so:
> wordsDic = {}
corpus = {}

....
for word in inFile.read().split():
if wordsDic.has_key( word ):
wordsDic[word] = wordsDic[word] + 1
else:
wordsDic[word] = 1
inFile.close()

How about:
for line in inFile:
for word in line.split():
try:
corpus[word] += 1
except KeyError:
corpus[word] = 1

....
wordsLst = wordsDic.items()
wordsLst.sort( moreCommonWord )
OK, here I'm going to get version specific.
For Python 2.4 and later:
words = sorted((-freq, word) for word, freq
in corpus.iteritems())
For at least Python 2.2:
words = [(-freq, word) for word, freq in corpus.iteritems()]
words.sort()
For before Python 2.2:
words = corpus.items()
words.sort(moreCommonWord)
for pair in wordsLst:
outFile.write( str( pair[1] ).rjust( 7 ) + " : " + str( pair[0] ) + "\n" )
outFile.close()

Before python 2.2 (because we use different data for words):
for word, frequency in words:
print >>outFile, '%7d : %s' % (frequency, word)

After python 2.2:
for negfrequency, word in words:
print >>outFile, '%7d : %s' % (-negfrequency, word)


So, with all my prejudices in place and python 2.4 on my box, I'd
lift a few things to functions:


def refcount(corpus, infile):
'''Update corpus counters in corpus from words in infile'''
for line in infile:
for word in line.split():
try:
corpus[word] += 1
except KeyError:
corpus[word] = 1


def main(sources, output=None):
'''Count words in sources and report frequencies to output'''
corpus = {}
for source in sources:
f = open(source)
refcount(corpus, f)
f.close()
for negfrequency, word in sorted((-frequency, word) for
word, frequency in corpus.iteritems()):
print >>output, '%7d : %s' % (-negfrequency, word)


if __name__ == '__main__':
import sys

if len(sys.argv) < 2:
main(sys.argv[1 :])
else:
output = open(sys.argv[-1], 'w')
try:
main(sys.argv[1 : -1], output)
finally:
output.close()


--Scott David Daniels
(e-mail address removed)
 
Q

Qertoip

Dnia Fri, 25 Mar 2005 12:51:59 -0800, Scott David Daniels napisa³(a):

Thanks for your reply! It was really enlightening.

How about:
for line in inFile:
for word in line.split():
try:
corpus[word] += 1
except KeyError:
corpus[word] = 1

Above is (probably) not efficient when exception is thrown, that is most of
the time (for any new word). However, I've just read about the following:
corpus[word] = corpus.setdefault( word, 0 ) + 1

OK, here I'm going to get version specific.
For Python 2.4 and later:
words = sorted((-freq, word) for word, freq in corpus.iteritems())

This is my favorite! :) You managed to avoid moreCommonWord() through the
clever use of list comprehensions and sequences comaparison rules.

After python 2.2:
for negfrequency, word in words:
print >>outFile, '%7d : %s' % (-negfrequency, word)

This is also cool, I didn't know about this kind of 'print' usage.

So, with all my prejudices in place and python 2.4 on my box, I'd
lift a few things to functions:

While I like your functionality and reusability improvements, I will stick
to my as-simple-as-possible solution for given requirements (which I didn't
mention, and which assume correct command line arguments for example).

Therefore, the current code is:
-------------------------------------------------------------------------
import sys

corpus = {}
inFile = open( sys.argv[1] )
for line in inFile:
for word in line.split():
corpus[word] = corpus.setdefault( word, 0 ) + 1
inFile.close()

words = sorted( ( -freq, word ) for word, freq in corpus.iteritems() )

outFile = open( sys.argv[2], 'w')
for negFreq, word in words:
print >>outFile, '%7d : %s' % ( -negFreq, word )
outFile.close()
 
L

Larry Bates

You might take advantage of the .get method on
dictionaries to rewrite:

wordsDic = {}
inFile = open( sys.argv[1] )
for word in inFile.read().split():
if wordsDic.has_key( word ):
wordsDic[word] = wordsDic[word] + 1
else:
wordsDic[word] = 1

as:

wordsDic = {}
inFile = open( sys.argv[1] )
for word in inFile.read().split():
wordsDict[word]=wordsDict.get(word, 0)+1


and taking advantage of tuple expansion and % formatting

for pair in wordsLst:
outFile.write( str( pair[1] ).rjust( 7 ) + " : " + str( pair[0] ) + "\n")

as

for word, count in wordsLst:
outFile.write("%7s : %i\n" % (word, count))

I guess you assumed all your words were less than 7 characters long (which
I copied).

But there are many other "good" ways I'm sure.

Larry Bates
 
S

Scott David Daniels

Qertoip said:
Dnia Fri, 25 Mar 2005 12:51:59 -0800, Scott David Daniels napisa³(a):
...
for word in line.split():
try:
corpus[word] += 1
except KeyError:
corpus[word] = 1

Above is (probably) not efficient when exception is thrown, that is most of
the time (for any new word). However, I've just read about the following:
corpus[word] = corpus.setdefault( word, 0 ) + 1

That is better for things like:
corpus.setdefault(word, []).append(...)

You might prefer:

corpus[word] = corpus.get(word, 0) + 1

The trade-off depends on the size of your test material. You need
to time it with your mix of words. I was thinking of cranking
through a huge body of text (so words of frequency 1 are by far
the minority case). If you run through Shakespeare's first folio,
and just do the counting part, the try-except and .get cases are
indistinguishable (2.0 sec for each), and the .setdefault version
drags in at a slow 2.2 sec. Just going through Anna Karenina,
again .83, .83 and .91. So the .setdefault form is 10% slower.
For great test cases, (and for your own personal edification)
visit Project Gutenberg.

Beware when you do timing: whether the file is "warm" or not can
make a huge difference. Read through it once before timing either.


--Scott David Daniels
(e-mail address removed)
 
Q

Qertoip

Dnia Fri, 25 Mar 2005 19:17:30 +0100, Qertoip napisa³(a):
Would you like to suggest me any improvements for the following code?
I want to make my implementation as simple, as Python - native, as fine as
possible.
I've written simple code, which reads input text file and creates words'
ranking by number of appearence.

Good friend of mine heard about my attempt to create compact, simple Python
script and authored the following PHP script:

--------------------------------------------------------------------------
$data=join(' ', file($argv[1]));
foreach(explode(' ', $data) as $slowo)
$stat[chop($slowo)]++;

array_multisort($stat, SORT_DESC, array_keys($stat));

foreach($stat as $sl=>$il)
$odata.="$il : $sl\n";

file_put_contents($argv[2], $odata);
--------------------------------------------------------------------------

....which has the same functionality with less actual code lines [7].
I'm a little bit confused, since I considered Python more expressive then
PHP. The more I'm interested in improving my implementation now :)
It looks like this [11 actual code lines]:

--------------------------------------------------------------------------
import sys

corpus = {}
inFile = open( sys.argv[1] )
for word in inFile.read().split():
corpus[word] = corpus.get( word, 0 ) + 1
inFile.close()

words = sorted( ( -freq, word ) for word, freq in corpus.iteritems() )

outFile = open( sys.argv[2], 'w')
for negFreq, word in words:
outFile.write( '%7d : %s\n' % ( -negFreq, word ) )
outFile.close()
 
P

Peter Hansen

Qertoip said:
Good friend of mine heard about my attempt to create compact, simple Python
script and authored the following PHP script: [snip 7-line PHP script]
...which has the same functionality with less actual code lines [7].
I'm a little bit confused, since I considered Python more expressive then
PHP. The more I'm interested in improving my implementation now :)
It looks like this [11 actual code lines]:
[snip 11-line Python]

If you want fewer lines, it's not too hard. Note that
if anyone has a problem with the style of this, I
point merely to the unreadable line noise that was
the PHP script as evidence that this is not about
style but conciseness:
--------------------------
import sys

corpus = {}
for word in open(sys.argv[1]).read().split():
corpus[word] = corpus.get( word, 0 ) + 1

words = reversed(sorted(data[::-1] for data in corpus.iteritems()))

open(sys.argv[2], 'w').writelines('%7d : %s\n' % data for data in words)
--------------------------

I'm curious if either this or the PHP does what is really
wanted, however. The above doesn't split on "words", but
merely on whitespace, making the results fairly meaningless
if you are concerned about punctuation etc.

-Peter
 
Q

Qertoip

Dnia Fri, 25 Mar 2005 21:09:41 -0500, Peter Hansen napisa³(a):

Thanks for comments! :)
Qertoip said:
Good friend of mine heard about my attempt to create compact, simple Python
script and authored the following PHP script: [snip 7-line PHP script]
...which has the same functionality with less actual code lines [7].
I'm a little bit confused, since I considered Python more expressive then
PHP. The more I'm interested in improving my implementation now :)
It looks like this [11 actual code lines]:
[snip 11-line Python]
--------------------------
import sys

corpus = {}
for word in open(sys.argv[1]).read().split():
corpus[word] = corpus.get( word, 0 ) + 1

words = reversed(sorted(data[::-1] for data in corpus.iteritems()))

open(sys.argv[2], 'w').writelines('%7d : %s\n' % data for data in words)
--------------------------

Is the file automatically closed in both cases?

I'm curious if either this or the PHP does what is really
wanted, however. The above doesn't split on "words", but
merely on whitespace, making the results fairly meaningless
if you are concerned about punctuation etc.

You are perfectly right, but the requirements were intentionally
simplified, ensuring 0-errors input and allowing words like 'hey,!1_go'

I aim to make my implementation as concise as possible but
*steel being natural, readable and clear*.
 
P

Peter Hansen

Qertoip said:
Peter said:
--------------------------
import sys

corpus = {}
for word in open(sys.argv[1]).read().split():
corpus[word] = corpus.get( word, 0 ) + 1

words = reversed(sorted(data[::-1] for data in corpus.iteritems()))

open(sys.argv[2], 'w').writelines('%7d : %s\n' % data for data in words)
--------------------------

Is the file automatically closed in both cases?

Yes. Either as soon as the file object has no more
references to it (with CPython), or as soon as the
application terminates or the garbage collection
executes to reclaim the objects (Jython, IronPython,
and any other Pythons that don't use reference
counting).

Either way, for a small script, it's safe.

-Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top