N
Nickolay Kolev
Hi all,
I am currently writing some simple functions in the process of learning
Python. I have a task where the program has to read in a text file and
display some statistics about the tokens in that file.
The text I have been feeding it is Dickens' David Copperfield.
It is really simple - it reads the file in memory, splits it on
whitespace, strips punctuation characters and transforms all remaining
elements to lowercase. It then looks through what has been left and
creates a list of tuples (count, word) which contain each unique word
and the number of time it appears in the text.
The code (~30 lines and easy to read can be found at
http://www.uni-bonn.de/~nmkolev/python/textStats.py
I am now looking for a way to make the whole thing run faster. I have
already made many changes since the initial version, realising many
mistakes. As I do not think of anything else, I thought I would ask the
more knowledgeable.
I find the two loops through the initial list a bit troubling. Could
this be avoided?
Any other remarks and suggestions will also be greatly appreciated.
Many thanks in advance,
Nicky
I am currently writing some simple functions in the process of learning
Python. I have a task where the program has to read in a text file and
display some statistics about the tokens in that file.
The text I have been feeding it is Dickens' David Copperfield.
It is really simple - it reads the file in memory, splits it on
whitespace, strips punctuation characters and transforms all remaining
elements to lowercase. It then looks through what has been left and
creates a list of tuples (count, word) which contain each unique word
and the number of time it appears in the text.
The code (~30 lines and easy to read can be found at
http://www.uni-bonn.de/~nmkolev/python/textStats.py
I am now looking for a way to make the whole thing run faster. I have
already made many changes since the initial version, realising many
mistakes. As I do not think of anything else, I thought I would ask the
more knowledgeable.
I find the two loops through the initial list a bit troubling. Could
this be avoided?
Any other remarks and suggestions will also be greatly appreciated.
Many thanks in advance,
Nicky