How to find all the same words in a text?

Discussion in 'Python' started by Johny, Feb 10, 2007.

  1. Johny

    Johny Guest

    I need to find all the same words in a text .
    What would be the best idea to do that?
    I used string.find but it does not work properly for the words.
    Let suppose I want to find a number 324 in the text

    '45 324 45324'

    there is only one occurrence of 324 word but string.find() finds 2
    occurrences ( in 45324 too)

    Must I use regex?
    Thanks for help
    L.
     
    Johny, Feb 10, 2007
    #1
    1. Advertising

  2. Johny

    Marco Giusti Guest

    On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
    >I need to find all the same words in a text .
    >What would be the best idea to do that?
    >I used string.find but it does not work properly for the words.
    >Let suppose I want to find a number 324 in the text
    >
    >'45 324 45324'
    >
    >there is only one occurrence of 324 word but string.find() finds 2
    >occurrences ( in 45324 too)


    >>> '45 324 45324'.split().count('324')

    1
    >>>


    ciao
    marco

    --
    reply to `python -c "print ''[::-1]"`

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFFzcu6mQRKGuVp5FMRArzTAKCpmT/ykP1K8HQaF30phLeq8zBUzQCfZCEU
    6RA4kH2QdMe0wcm97MrUWfM=
    =p9iU
    -----END PGP SIGNATURE-----
     
    Marco Giusti, Feb 10, 2007
    #2
    1. Advertising

  3. Johny

    Johny Guest

    On Feb 10, 2:42 pm, Marco Giusti <> wrote:
    > On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
    > >I need to find all the same words in a text .
    > >What would be the best idea to do that?
    > >I used string.find but it does not work properly for the words.
    > >Let suppose I want to find a number 324 in the text

    >
    > >'45 324 45324'

    >
    > >there is only one occurrence of 324 word but string.find() finds 2
    > >occurrences ( in 45324 too)

    >
    > >>> '45 324 45324'.split().count('324')

    > 1
    > >>>

    >
    > ciao

    Marco,
    Thank you for your help.
    It works perfectly but I forgot to say that I also need to find the
    possition of each word's occurrence.Is it possible that
    Thanks.
    L
     
    Johny, Feb 10, 2007
    #3
  4. Johny

    ZeD Guest

    Johny wrote:

    >> >Let suppose I want to find a number 324 in the text

    >>
    >> >'45 324 45324'

    >>
    >> >there is only one occurrence of 324 word but string.find() finds 2
    >> >occurrences ( in 45324 too)

    >>
    >> >>> '45 324 45324'.split().count('324')

    >> 1
    >> >>>

    >>
    >> ciao

    > Marco,
    > Thank you for your help.
    > It works perfectly but I forgot to say that I also need to find the
    > possition of each word's occurrence.Is it possible that


    >>> [i for i, e in enumerate('45 324 45324'.split()) if e=='324']

    [1]
    >>>


    --
    Under construction
     
    ZeD, Feb 10, 2007
    #4
  5. Johny

    Marco Giusti Guest

    On Sat, Feb 10, 2007 at 06:00:05AM -0800, Johny wrote:
    >On Feb 10, 2:42 pm, Marco Giusti <> wrote:
    >> On Sat, Feb 10, 2007 at 05:29:23AM -0800, Johny wrote:
    >> >I need to find all the same words in a text .
    >> >What would be the best idea to do that?
    >> >I used string.find but it does not work properly for the words.
    >> >Let suppose I want to find a number 324 in the text

    >>
    >> >'45 324 45324'

    >>
    >> >there is only one occurrence of 324 word but string.find() finds 2
    >> >occurrences ( in 45324 too)

    >>
    >> >>> '45 324 45324'.split().count('324')

    >> 1
    >> >>>

    >>
    >> ciao

    >Marco,
    >Thank you for your help.
    >It works perfectly but I forgot to say that I also need to find the
    >possition of each word's occurrence.Is it possible that


    >>> li = '45 324 45324'.split()
    >>> li.index('324')

    1
    >>>


    play with count and index and take a look at the help of both

    ciao
    marco

    --
    reply to `python -c "print ''[::-1]"`

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)

    iD8DBQFFzdOomQRKGuVp5FMRAt3/AKCSyzCOdSRijxL0GjK3tspZ/sHaYwCfeDzZ
    5pmB1RyUlGjhrnxy1YBFArU=
    =r/Hl
    -----END PGP SIGNATURE-----
     
    Marco Giusti, Feb 10, 2007
    #5
  6. * Johny (10 Feb 2007 05:29:23 -0800)
    > I need to find all the same words in a text .
    > What would be the best idea to do that?
    > I used string.find but it does not work properly for the words.
    > Let suppose I want to find a number 324 in the text
    >
    > '45 324 45324'
    >
    > there is only one occurrence of 324 word but string.find() finds 2
    > occurrences ( in 45324 too)
    >
    > Must I use regex?


    There are two approaches: one is the "solve once and forget" approach
    where you code around this particular problem. Mario showed you one
    solution for this.

    The other approach would be to realise that your problem is a specific
    case of two general problems: partitioning a sequence by a separator
    and partioning a sequence into equivalence classes. The bonus for this
    approach is that you will have a /lot/ of problems that can be solved
    with either one of these utils or a combination of them.

    1>>> a = '45 324 45324'
    2>>> quotient_set(part(a, [' ', ' '], 'sep'), ident)
    2: {'324': ['324'], '45': ['45'], '45324': ['45324']}

    The latter approach is much more flexible. Just imagine your problem
    changes to a string that's separated by newlines (instead of spaces)
    and you want to find words that start with the same character (instead
    of being the same as criterion).


    Thorsten
     
    Thorsten Kampe, Feb 10, 2007
    #6
  7. "Johny" <> on 10 Feb 2007 05:29:23 -0800 didst step
    forth and proclaim thus:

    > I need to find all the same words in a text .
    > What would be the best idea to do that?


    I make no claims of this being the best approach:

    ====================
    def findOccurances(a_string, word):
    """
    Given a string and a word, returns a double:
    [0] = count [1] = list of indexes where word occurs
    """
    import re
    count = 0
    indexes = []
    start = 0 # offset for successive passes
    pattern = re.compile(r'\b%s\b' % word, re.I)

    while True:
    match = pattern.search(a_string)
    if not match: break
    count += 1;
    indexes.append(match.start() + start)
    start += match.end()
    a_string = a_string[match.end():]

    return (count, indexes)
    ====================

    Seems to work for me. No guarantees.

    --
    Sam Peterson
    skpeterson At nospam ucdavis.edu
    "if programmers were paid to remove code instead of adding it,
    software would be much better" -- unknown
     
    Samuel Karl Peterson, Feb 11, 2007
    #7
  8. Johny

    Neil Cerutti Guest

    On 2007-02-10, Johny <> wrote:
    > I need to find all the same words in a text .
    > What would be the best idea to do that?
    > I used string.find but it does not work properly for the words.
    > Let suppose I want to find a number 324 in the text
    >
    > '45 324 45324'
    >
    > there is only one occurrence of 324 word but string.find() finds 2
    > occurrences ( in 45324 too)
    >
    > Must I use regex?
    > Thanks for help


    The first thing to do is to answer the question: What is a word?

    The second thing to do is to design some code that can find
    words in strings.

    The last thing to do is to search those actual words for the word
    you're looking for.

    --
    Neil Cerutti
     
    Neil Cerutti, Feb 11, 2007
    #8
  9. In order to find all the words in a text, you need to tokenize it first.
    The rest is a matter of calling the count method on the list of
    tokenized words. For tokenization look here:
    http://nltk.sourceforge.net/lite/doc/en/words.html
    A little bit of warning: depending on what exactly you need to do, the
    seemingly trivial taks of tokenizing a text can become quite complex.

    Enjoy,

    Maël

    Neil Cerutti schrieb:
    > On 2007-02-10, Johny <> wrote:
    >> I need to find all the same words in a text .
    >> What would be the best idea to do that?
    >> I used string.find but it does not work properly for the words.
    >> Let suppose I want to find a number 324 in the text
    >>
    >> '45 324 45324'
    >>
    >> there is only one occurrence of 324 word but string.find() finds 2
    >> occurrences ( in 45324 too)
    >>
    >> Must I use regex?
    >> Thanks for help

    >
    > The first thing to do is to answer the question: What is a word?
    >
    > The second thing to do is to design some code that can find
    > words in strings.
    >
    > The last thing to do is to search those actual words for the word
    > you're looking for.
    >
     
    =?ISO-8859-1?Q?Ma=EBl_Benjamin_Mettler?=, Feb 11, 2007
    #9
  10. Johny

    Guest

    On Feb 11, 5:13 am, Samuel Karl Peterson
    <> wrote:
    > "Johny" <> on 10 Feb 2007 05:29:23 -0800 didst step
    > forth and proclaim thus:
    >
    > > I need to find all the same words in a text .
    > > What would be the best idea to do that?

    >
    > I make no claims of this being the best approach:
    >
    > ====================
    > def findOccurances(a_string, word):
    > """
    > Given a string and a word, returns a double:
    > [0] = count [1] = list of indexes where word occurs
    > """
    > import re
    > count = 0
    > indexes = []
    > start = 0 # offset for successive passes
    > pattern = re.compile(r'\b%s\b' % word, re.I)
    >
    > while True:
    > match = pattern.search(a_string)
    > if not match: break
    > count += 1;
    > indexes.append(match.start() + start)
    > start += match.end()
    > a_string = a_string[match.end():]
    >
    > return (count, indexes)
    > ====================
    >
    > Seems to work for me. No guarantees.
    >




    More concisely:

    import re

    pattern = re.compile(r'\b324\b')
    indices = [ match.start() for match in
    pattern.finditer(target_string) ]
    print "Indices", indices
    print "Count: ", len(indices)

    --
    Cheers,
    Steven
     
    , Feb 11, 2007
    #10
  11. on 11 Feb 2007 08:16:11 -0800 didst step
    forth and proclaim thus:

    > More concisely:
    >
    > import re
    >
    > pattern = re.compile(r'\b324\b')
    > indices = [ match.start() for match in
    > pattern.finditer(target_string) ]
    > print "Indices", indices
    > print "Count: ", len(indices)
    >


    Thank you, this is educational. I didn't realize that finditer
    returned match objects instead of tuples.

    > Cheers,
    > Steven
    >


    --
    Sam Peterson
    skpeterson At nospam ucdavis.edu
    "if programmers were paid to remove code instead of adding it,
    software would be much better" -- unknown
     
    Samuel Karl Peterson, Feb 12, 2007
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Strøiman
    Replies:
    1
    Views:
    2,132
    Peter Strøiman
    Aug 23, 2005
  2. Richard Heathfield
    Replies:
    7
    Views:
    393
    Barry Schwarz
    Oct 5, 2003
  3. utab

    Words Words

    utab, Feb 16, 2006, in forum: C++
    Replies:
    6
    Views:
    445
    Daniel T.
    Feb 16, 2006
  4. BerlinBrown
    Replies:
    6
    Views:
    4,833
  5. Lasse Edsvik

    replace words with bold words

    Lasse Edsvik, Oct 5, 2003, in forum: ASP General
    Replies:
    9
    Views:
    252
Loading...

Share This Page