Replace stop words (remove words from a string)

Discussion in 'Python' started by BerlinBrown, Jan 17, 2008.

  1. BerlinBrown

    BerlinBrown Guest

    if I have an array of "stop" words, and I want to replace those values
    with something else; in a string, how would I go about doing this. I
    have this code that splits the string and then does a difference but I
    think there is an easier approach:

    E.g.

    mystr =
    kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldfsd;

    if I have an array stop_list = [ "[BAD]", "[BAD2]" ]

    I want to replace the values in that list with a zero length string.

    I had this before, but I don't want to use this approach; I don't want
    to use the split.

    line_list = line.lower().split()
    res = list(set(keywords_list).difference(set(ENTITY_IGNORE_LIST)))
     
    BerlinBrown, Jan 17, 2008
    #1
    1. Advertisements

  2. BerlinBrown

    Karthik Guest

    How about -

    for s in stoplist:
    string.replace(mystr, s, "")

    Hope this should work.

    -----Original Message-----
    From: python-list-bounces+karthik3186=
    [mailto:python-list-bounces+karthik3186=] On Behalf Of
    BerlinBrown
    Sent: Thursday, January 17, 2008 1:55 PM
    To:
    Subject: Replace stop words (remove words from a string)

    if I have an array of "stop" words, and I want to replace those values
    with something else; in a string, how would I go about doing this. I
    have this code that splits the string and then does a difference but I
    think there is an easier approach:

    E.g.

    mystr =
    kljsldkfjksjdfjsdjflkdjslkf[BAD]Kkjkkkkjkkjk[BAD]LSKJFKSFJKSJF;L[BAD2]kjsldf
    sd;

    if I have an array stop_list = [ "[BAD]", "[BAD2]" ]

    I want to replace the values in that list with a zero length string.

    I had this before, but I don't want to use this approach; I don't want
    to use the split.

    line_list = line.lower().split()
    res = list(set(keywords_list).difference(set(ENTITY_IGNORE_LIST)))
     
    Karthik, Jan 17, 2008
    #2
    1. Advertisements

  3. BerlinBrown

    Gary Herron Guest

    String have a replace method that will produce a new string with (all
    occurrences of) one substring replaced with another. You'd have to loop
    through your stop_list one word at a time.
    'abcabc'


    If either the string or the stop_list grows particularly large, this
    approach won't scale very well since the whole string would be
    re-created anew for each stop_list entry. In that case, I'd look into
    the regular expression (re) module. You may be able to finagle a way to
    find and replace all stop_list entries in one pass. (Finding them all
    is easy -- not so sure you could replace them all at once though. )


    Gary Herron
     
    Gary Herron, Jan 17, 2008
    #3
  4. BerlinBrown

    Gary Herron Guest

    That will work, but the string module is long outdated. Better to use
    string methods:

    for s in stoplist:
    mystr.replace(s, "")

    Gary Herron

     
    Gary Herron, Jan 17, 2008
    #4
  5. Regular expressions should do the trick.

    Try this:
    'kljsldkfjksjdfjsdjflkdjslkfKkjkkkkjkkjkLSKJFKSFJKSJF;Lkjsld\xadfsd;'

    Raymond
     
    Raymond Hettinger, Jan 17, 2008
    #5
  6. BerlinBrown a écrit :
    res = mystr
    for stop_word in stop_list:
    res = res.replace(stop_word, '')
     
    Bruno Desthuilliers, Jan 17, 2008
    #6
  7. Raymond Hettinger:
    If the stop words are many (and similar) then that RE can be optimized
    with a trie-based strategy, like this one called "List":
    http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/List.pm

    "List" is used by something more complex called "Optimizer" that's
    overkill for the OP problem:
    http://search.cpan.org/~dankogai/Regexp-Optimizer-0.15/lib/Regexp/Optimizer.pm

    I don't know if a Python module similar to "List" is available, I may
    write it :)

    Bye,
    bearophile
     
    bearophileHUGS, Jan 17, 2008
    #7
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.