Howegrown wordcount

Discussion in 'Python' started by Thomas Philips, Jun 11, 2004.

  1. I've coded a little word counting routine that handles a reasonably
    wide range of inputs. How could it be made to cover more, though
    admittedly more remote, possibilites such as nested lists of lists,
    items for which the string representation is a string containing lists
    etc. etc. without significantly increasing the complexity of the
    program?

    Thomas Philips

    def wordcount(input):

    from string import whitespace

    #Treat iterable inputs differently
    if "__iter__" in dir(input):
    wordList =(" ".join([str(item) for item in input])).split()
    else:
    wordList = [str(input)]

    #Remove any words that are just whitespace
    for i,word in enumerate(wordList):
    while word and word[-1] in whitespace:
    word = word[:-1]
    wordList = word
    wc = len(filter(None,wordList)) #Filter out any empty strings
    return wc
     
    Thomas Philips, Jun 11, 2004
    #1
    1. Advertising

  2. Thomas Philips

    David Wilson Guest

    On Fri, Jun 11, 2004 at 11:05:32AM -0700, Thomas Philips wrote:
    > I've coded a little word counting routine that handles a reasonably
    > wide range of inputs. How could it be made to cover more, though
    > admittedly more remote, possibilites such as nested lists of lists,
    > items for which the string representation is a string containing lists
    > etc. etc. without significantly increasing the complexity of the
    > program?


    Hello,

    Such 'magical' behaviour is error prone and causes many a headache when
    debugging. Some might think that even this is too much:

    > #Treat iterable inputs differently
    > if "__iter__" in dir(input):
    > wordList =(" ".join([str(item) for item in input])).split()
    > else:
    > wordList = [str(input)]


    Myself included. Perhaps instead of increasing the complexity of this
    function, why not write a few wrapper functions if you have the need.


    David.

    --
    "Science is what we understand well enough to explain to a
    computer. Art is everything else we do."
    -- Donald Knuth
     
    David Wilson, Jun 11, 2004
    #2
    1. Advertising

  3. Thomas Philips

    Larry Bates Guest

    Something like this?

    def wordcount(input, sep=" "):
    global words
    if isinstance(input, str):
    words+=len([x.strip() for x in input.split(sep)])
    return words
    else:
    for item in input:
    wordcount(item)

    return words

    #
    # Test with a string
    #
    words=0
    print wordcount("This is a test") # String test
    words=0
    print wordcount(["This is a test", "This is a test"]) # List test
    words=0
    print wordcount([["This is a test","This is a test"],
    ["This is a test","This is a test"]]) # List of lists
    words=0
    data=[["this is a test"],["this", "is", "a", "test"],"This is a test"]
    print wordcount(data)

    HTH,
    Larry Bates


    "Thomas Philips" <> wrote in message
    news:...
    > I've coded a little word counting routine that handles a reasonably
    > wide range of inputs. How could it be made to cover more, though
    > admittedly more remote, possibilites such as nested lists of lists,
    > items for which the string representation is a string containing lists
    > etc. etc. without significantly increasing the complexity of the
    > program?
    >
    > Thomas Philips
    >
    > def wordcount(input):
    >
    > from string import whitespace
    >
    > #Treat iterable inputs differently
    > if "__iter__" in dir(input):
    > wordList =(" ".join([str(item) for item in input])).split()
    > else:
    > wordList = [str(input)]
    >
    > #Remove any words that are just whitespace
    > for i,word in enumerate(wordList):
    > while word and word[-1] in whitespace:
    > word = word[:-1]
    > wordList = word
    > wc = len(filter(None,wordList)) #Filter out any empty strings
    > return wc
     
    Larry Bates, Jun 11, 2004
    #3
  4. An embarrassing mistake on my part: I should have typed
    #Treat iterable inputs differently
    if "__iter__" in dir(input):
    wordList =(" ".join([str(item) for item in input])).split()
    else:
    wordList = str(input).split()

    I wish I knew how to treat all possible inputs in a uniform fashion,
    but I'm nowhere near there as yet, hence the question. That said, it
    addressess the situations that arise in practice fairly well, though I
    am sure it can be sped up substantially.

    Thomas Philips
     
    Thomas Philips, Jun 12, 2004
    #4
  5. Larry Bates wrote:
    > Something like this?
    >
    > def wordcount(input, sep=" "):
    > global words
    > if isinstance(input, str):
    > words+=len([x.strip() for x in input.split(sep)])


    What's the purpose of stripping the items in the list if you just count
    their number ? Isn't this equivalent to
    words += len(input.split(sep))

    > return words
    > else:
    > for item in input:
    > wordcount(item)
    >
    > return words


    Removing the global statement and sep param, you get:

    def wordcount(input):
    if isinstance(input, str):
    return len(input.split())
    else:
    return sum([wordcount(item) for item in input])

    --
    Grégoire Dooms
     
    =?ISO-8859-1?Q?Gr=E9goire_Dooms?=, Jun 12, 2004
    #5
  6. Grégoire Dooms wrote:


    > What's the purpose of stripping the items in the list if you just count
    > their number ? Isn't this equivalent to
    > words += len(input.split(sep))
    >
    >> return words
    >> else:
    >> for item in input:
    >> wordcount(item)
    >>
    >> return words

    >
    >
    > Removing the global statement and sep param, you get:
    >
    > def wordcount(input):
    > if isinstance(input, str):
    > return len(input.split())
    > else:
    > return sum([wordcount(item) for item in input])
    >
    > --
    > Grégoire Dooms


    After reading this thread, I decided to embark on a word counting
    program of my own. One thing I like to do when learning new programming
    languages is to try and emulate some of my favorite UNIX type programs.

    That said, to get the count of words in a string, I merely did the
    following:


    # Beginning of program

    import re

    # Right now my simple wc program just reads piped data
    if not sys.stdin.isatty(): input_data = sys.stdin.read()

    print "number of words:", len(re.findall('[^\s]+', input_data))

    # End of program

    Though I've only done trivial tests on this up to now, the word count of
    this script seems to match that of the wc on my system (RH Linux WS). I
    ran some big RFC text files through this too.

    There could be some flaws here; I don't know. I'll have to look at it
    better when I get back from the gym. If anyone here finds a problem, I'd
    be interested in hearing it.

    Like I said, I love using these UNIX type programs to learn a new
    language. It helps me learn things like file I/O, command line
    arguments, string manipulations.. etc.

    Keith P. Boruff
     
    Keith P. Boruff, Jun 13, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page