Howegrown wordcount

Thomas Philips · Jun 11, 2004

I've coded a little word counting routine that handles a reasonably
wide range of inputs. How could it be made to cover more, though
admittedly more remote, possibilites such as nested lists of lists,
items for which the string representation is a string containing lists
etc. etc. without significantly increasing the complexity of the
program?

Thomas Philips

def wordcount(input):

from string import whitespace

#Treat iterable inputs differently
if "__iter__" in dir(input):
wordList =(" ".join([str(item) for item in input])).split()
else:
wordList = [str(input)]

#Remove any words that are just whitespace
for i,word in enumerate(wordList):
while word and word[-1] in whitespace:
word = word[:-1]
wordList = word
wc = len(filter(None,wordList)) #Filter out any empty strings
return wc

David Wilson · Jun 11, 2004

I've coded a little word counting routine that handles a reasonably
wide range of inputs. How could it be made to cover more, though
admittedly more remote, possibilites such as nested lists of lists,
items for which the string representation is a string containing lists
etc. etc. without significantly increasing the complexity of the
program?

Hello,

Such 'magical' behaviour is error prone and causes many a headache when
debugging. Some might think that even this is too much:

#Treat iterable inputs differently
if "__iter__" in dir(input):
wordList =(" ".join([str(item) for item in input])).split()
else:
wordList = [str(input)]

Myself included. Perhaps instead of increasing the complexity of this
function, why not write a few wrapper functions if you have the need.

David.

Larry Bates · Jun 11, 2004

Something like this?

def wordcount(input, sep=" "):
global words
if isinstance(input, str):
words+=len([x.strip() for x in input.split(sep)])
return words
else:
for item in input:
wordcount(item)

return words

#
# Test with a string
#
words=0
print wordcount("This is a test") # String test
words=0
print wordcount(["This is a test", "This is a test"]) # List test
words=0
print wordcount([["This is a test","This is a test"],
["This is a test","This is a test"]]) # List of lists
words=0
data=[["this is a test"],["this", "is", "a", "test"],"This is a test"]
print wordcount(data)

HTH,
Larry Bates

Thomas Philips · Jun 12, 2004

An embarrassing mistake on my part: I should have typed
#Treat iterable inputs differently
if "__iter__" in dir(input):
wordList =(" ".join([str(item) for item in input])).split()
else:
wordList = str(input).split()

I wish I knew how to treat all possible inputs in a uniform fashion,
but I'm nowhere near there as yet, hence the question. That said, it
addressess the situations that arise in practice fairly well, though I
am sure it can be sped up substantially.

Thomas Philips

=?ISO-8859-1?Q?Gr=E9goire_Dooms?= · Jun 12, 2004

Larry said:
Something like this?

def wordcount(input, sep=" "):
global words
if isinstance(input, str):
words+=len([x.strip() for x in input.split(sep)])

What's the purpose of stripping the items in the list if you just count
their number ? Isn't this equivalent to
words += len(input.split(sep))

return words
else:
for item in input:
wordcount(item)

return words

Removing the global statement and sep param, you get:

def wordcount(input):
if isinstance(input, str):
return len(input.split())
else:
return sum([wordcount(item) for item in input])

Keith P. Boruff · Jun 13, 2004

Grégoire Dooms wrote:

What's the purpose of stripping the items in the list if you just count
their number ? Isn't this equivalent to
words += len(input.split(sep))

return words
else:
for item in input:
wordcount(item)

return words

Click to expand...

Removing the global statement and sep param, you get:

def wordcount(input):
if isinstance(input, str):
return len(input.split())
else:
return sum([wordcount(item) for item in input])

After reading this thread, I decided to embark on a word counting
program of my own. One thing I like to do when learning new programming
languages is to try and emulate some of my favorite UNIX type programs.

That said, to get the count of words in a string, I merely did the
following:

# Beginning of program

import re

# Right now my simple wc program just reads piped data
if not sys.stdin.isatty(): input_data = sys.stdin.read()

print "number of words:", len(re.findall('[^\s]+', input_data))

# End of program

Though I've only done trivial tests on this up to now, the word count of
this script seems to match that of the wc on my system (RH Linux WS). I
ran some big RFC text files through this too.

There could be some flaws here; I don't know. I'll have to look at it
better when I get back from the gym. If anyone here finds a problem, I'd
be interested in hearing it.

Like I said, I love using these UNIX type programs to learn a new
language. It helps me learn things like file I/O, command line
arguments, string manipulations.. etc.

Keith P. Boruff

ChatBot	4	Jan 19, 2021
TypeError: Can't convert 'int' object to str implicitly	12	Apr 26, 2013
Recursive generator for combinations of a multiset?	0	Nov 21, 2013
need help pleaseeeeeeeeee	4	Oct 19, 2006
i need help with this project please some one help meeee	4	Oct 19, 2006
print header for output	0	Jun 19, 2011
need help pleaseeeeeeeeeeeeeeeeee	6	Oct 19, 2006
can any one help me with this	1	Oct 19, 2006

Howegrown wordcount

Thomas Philips

David Wilson

Larry Bates

Thomas Philips

=?ISO-8859-1?Q?Gr=E9goire_Dooms?=

Keith P. Boruff

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads