How do I skip over multiple words in a file?

Discussion in 'Python' started by chad, Nov 11, 2010.

  1. chad

    chad Guest

    Let's say that I have an article. What I want to do is read in this
    file and have the program skip over ever instance of the words "the",
    "and", "or", and "but". What would be the general strategy for
    attacking a problem like this?
    chad, Nov 11, 2010
    #1
    1. Advertising

  2. chad

    Tim Chase Guest

    On 11/11/10 09:07, chad wrote:
    > Let's say that I have an article. What I want to do is read in
    > this file and have the program skip over ever instance of the
    > words "the", "and", "or", and "but". What would be the
    > general strategy for attacking a problem like this?


    I'd keep a file of "stop words", read them into a set
    (normalizing case in the process). Then, as I skim over each
    word in my target file, check if the case-normalized version of
    the word is in your stop-words and skipping if it is. It might
    look something like this:

    def normalize_word(s):
    return s.strip().upper()

    stop_words = set(
    normalize_word(word)
    for word in file('stop_words.txt')
    )
    for line in file('data.txt'):
    for word in line.split():
    if normalize_word(word) in stop_words: continue
    process(word)

    -tkc
    Tim Chase, Nov 11, 2010
    #2
    1. Advertising

  3. chad

    r0g Guest

    On 11/11/10 15:07, chad wrote:
    > Let's say that I have an article. What I want to do is read in this
    > file and have the program skip over ever instance of the words "the",
    > "and", "or", and "but". What would be the general strategy for
    > attacking a problem like this?



    If your files are not too big I'd simply read them into a string and do
    a string replace for each word you want to skip. If you want case
    insensitivity use re.replace() instead of the default string.replace()
    method. Neither are elegant or all that efficient but both are very
    easy. If your use case requires something high performance then best
    keep looking :)

    Roger.
    r0g, Nov 11, 2010
    #3
  4. chad

    Paul Watson Guest

    On 2010-11-11 08:07, chad wrote:
    > Let's say that I have an article. What I want to do is read in this
    > file and have the program skip over ever instance of the words "the",
    > "and", "or", and "but". What would be the general strategy for
    > attacking a problem like this?


    I realize that you may need or want to do this in Python. This would be
    trivial in an awk script.
    Paul Watson, Nov 11, 2010
    #4
  5. chad

    Paul Rubin Guest

    chad <> writes:

    > Let's say that I have an article. What I want to do is read in this
    > file and have the program skip over ever instance of the words "the",
    > "and", "or", and "but". What would be the general strategy for
    > attacking a problem like this?


    Something like (untested):

    stopwords = set (('and', 'or', 'but'))

    def goodwords():
    for line in file:
    for w in line.split():
    if w.lower() not in stopwords:
    yield w

    Removing punctuation is left as an exercise.
    Paul Rubin, Nov 11, 2010
    #5
  6. Am 11.11.2010 21:33, schrieb Paul Watson:
    > On 2010-11-11 08:07, chad wrote:
    >> Let's say that I have an article. What I want to do is read in this
    >> file and have the program skip over ever instance of the words "the",
    >> "and", "or", and "but". What would be the general strategy for
    >> attacking a problem like this?

    >
    > I realize that you may need or want to do this in Python. This would
    > be trivial in an awk script.

    There are several ways to do this.

    skip = ('and','or','but')
    all=[]
    [[all.append(w) for w in l.split() if w not in skip] for l in
    open('some.txt').readlines()]
    print all

    If some.txt contains your original question, it returns this:
    ["Let's", 'say', 'that', 'I', 'have', 'an', 'article.', 'What', 'I',
    'want', 'to
    ', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program',
    'skip', '
    over', 'ever', 'instance', 'of', 'the', 'words', '"the",', '"and",',
    '"or",', '"
    but".', 'What', 'would', 'be', 'the', 'general', 'strategy', 'for',
    'attacking',
    'a', 'problem', 'like', 'this?']

    But this _one_ way to get there.
    Faster solutions could be based on a regex:
    import re
    skip = ('and','or','but')
    all = re.compile('(\w+)')
    print [w for w in all.findall(open('some.txt').read()) if w not in skip]

    this gives this result (you loose some punctuation etc):
    ['Let', 's', 'say', 'that', 'I', 'have', 'an', 'article', 'What', 'I',
    'want', '
    to', 'do', 'is', 'read', 'in', 'this', 'file', 'have', 'the', 'program',
    'skip',
    'over', 'ever', 'instance', 'of', 'the', 'words', 'the', 'What',
    'would', 'be',
    'the', 'general', 'strategy', 'for', 'attacking', 'a', 'problem',
    'like', 'this
    ']

    But there are some many ways to do it ...
    Stefan Sonnenberg-Carstens, Nov 11, 2010
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,085
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    362
    Barry Schwarz
    Oct 5, 2003
  3. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    421
    Daniel T.
    Feb 16, 2006
  4. scad
    Replies:
    4
    Views:
    952
    James Kanze
    May 28, 2009
  5. Matt Williamson

    skip over table head values

    Matt Williamson, Aug 5, 2005, in forum: Javascript
    Replies:
    2
    Views:
    99
Loading...

Share This Page