Howegrown wordcount

T

Thomas Philips

I've coded a little word counting routine that handles a reasonably
wide range of inputs. How could it be made to cover more, though
admittedly more remote, possibilites such as nested lists of lists,
items for which the string representation is a string containing lists
etc. etc. without significantly increasing the complexity of the
program?

Thomas Philips

def wordcount(input):

from string import whitespace

#Treat iterable inputs differently
if "__iter__" in dir(input):
wordList =(" ".join([str(item) for item in input])).split()
else:
wordList = [str(input)]

#Remove any words that are just whitespace
for i,word in enumerate(wordList):
while word and word[-1] in whitespace:
word = word[:-1]
wordList = word
wc = len(filter(None,wordList)) #Filter out any empty strings
return wc
 
D

David Wilson

I've coded a little word counting routine that handles a reasonably
wide range of inputs. How could it be made to cover more, though
admittedly more remote, possibilites such as nested lists of lists,
items for which the string representation is a string containing lists
etc. etc. without significantly increasing the complexity of the
program?

Hello,

Such 'magical' behaviour is error prone and causes many a headache when
debugging. Some might think that even this is too much:
#Treat iterable inputs differently
if "__iter__" in dir(input):
wordList =(" ".join([str(item) for item in input])).split()
else:
wordList = [str(input)]

Myself included. Perhaps instead of increasing the complexity of this
function, why not write a few wrapper functions if you have the need.


David.
 
L

Larry Bates

Something like this?

def wordcount(input, sep=" "):
global words
if isinstance(input, str):
words+=len([x.strip() for x in input.split(sep)])
return words
else:
for item in input:
wordcount(item)

return words

#
# Test with a string
#
words=0
print wordcount("This is a test") # String test
words=0
print wordcount(["This is a test", "This is a test"]) # List test
words=0
print wordcount([["This is a test","This is a test"],
["This is a test","This is a test"]]) # List of lists
words=0
data=[["this is a test"],["this", "is", "a", "test"],"This is a test"]
print wordcount(data)

HTH,
Larry Bates
 
T

Thomas Philips

An embarrassing mistake on my part: I should have typed
#Treat iterable inputs differently
if "__iter__" in dir(input):
wordList =(" ".join([str(item) for item in input])).split()
else:
wordList = str(input).split()

I wish I knew how to treat all possible inputs in a uniform fashion,
but I'm nowhere near there as yet, hence the question. That said, it
addressess the situations that arise in practice fairly well, though I
am sure it can be sped up substantially.

Thomas Philips
 
?

=?ISO-8859-1?Q?Gr=E9goire_Dooms?=

Larry said:
Something like this?

def wordcount(input, sep=" "):
global words
if isinstance(input, str):
words+=len([x.strip() for x in input.split(sep)])

What's the purpose of stripping the items in the list if you just count
their number ? Isn't this equivalent to
words += len(input.split(sep))
return words
else:
for item in input:
wordcount(item)

return words

Removing the global statement and sep param, you get:

def wordcount(input):
if isinstance(input, str):
return len(input.split())
else:
return sum([wordcount(item) for item in input])
 
K

Keith P. Boruff

Grégoire Dooms wrote:

What's the purpose of stripping the items in the list if you just count
their number ? Isn't this equivalent to
words += len(input.split(sep))
return words
else:
for item in input:
wordcount(item)

return words


Removing the global statement and sep param, you get:

def wordcount(input):
if isinstance(input, str):
return len(input.split())
else:
return sum([wordcount(item) for item in input])

After reading this thread, I decided to embark on a word counting
program of my own. One thing I like to do when learning new programming
languages is to try and emulate some of my favorite UNIX type programs.

That said, to get the count of words in a string, I merely did the
following:


# Beginning of program

import re

# Right now my simple wc program just reads piped data
if not sys.stdin.isatty(): input_data = sys.stdin.read()

print "number of words:", len(re.findall('[^\s]+', input_data))

# End of program

Though I've only done trivial tests on this up to now, the word count of
this script seems to match that of the wc on my system (RH Linux WS). I
ran some big RFC text files through this too.

There could be some flaws here; I don't know. I'll have to look at it
better when I get back from the gym. If anyone here finds a problem, I'd
be interested in hearing it.

Like I said, I love using these UNIX type programs to learn a new
language. It helps me learn things like file I/O, command line
arguments, string manipulations.. etc.

Keith P. Boruff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top