How to find all the same words in a text?

Johny · Feb 10, 2007

I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help
L.

Marco Giusti · Feb 10, 2007

I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

ciao
marco

--
reply to `python -c "print '(e-mail address removed)'[::-1]"`

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFzcu6mQRKGuVp5FMRArzTAKCpmT/ykP1K8HQaF30phLeq8zBUzQCfZCEU
6RA4kH2QdMe0wcm97MrUWfM=
=p9iU
-----END PGP SIGNATURE-----

Johny · Feb 10, 2007

ciao

Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that
Thanks.
L

ZeD · Feb 10, 2007

Johny said:
ciao

Click to expand...

Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that

[i for i, e in enumerate('45 324 45324'.split()) if e=='324'] [1]

Click to expand...

Click to expand...

Marco Giusti · Feb 10, 2007

Marco,
Thank you for your help.
It works perfectly but I forgot to say that I also need to find the
possition of each word's occurrence.Is it possible that

play with count and index and take a look at the help of both

ciao
marco

--
reply to `python -c "print '(e-mail address removed)'[::-1]"`

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFFzdOomQRKGuVp5FMRAt3/AKCSyzCOdSRijxL0GjK3tspZ/sHaYwCfeDzZ
5pmB1RyUlGjhrnxy1YBFArU=
=r/Hl
-----END PGP SIGNATURE-----

Thorsten Kampe · Feb 10, 2007

* Johny (10 Feb 2007 05:29:23 -0800)

I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?

There are two approaches: one is the "solve once and forget" approach
where you code around this particular problem. Mario showed you one
solution for this.

The other approach would be to realise that your problem is a specific
case of two general problems: partitioning a sequence by a separator
and partioning a sequence into equivalence classes. The bonus for this
approach is that you will have a /lot/ of problems that can be solved
with either one of these utils or a combination of them.

1>>> a = '45 324 45324'
2>>> quotient_set(part(a, [' ', ' '], 'sep'), ident)
2: {'324': ['324'], '45': ['45'], '45324': ['45324']}

The latter approach is much more flexible. Just imagine your problem
changes to a string that's separated by newlines (instead of spaces)
and you want to find words that start with the same character (instead
of being the same as criterion).

Thorsten

Samuel Karl Peterson · Feb 11, 2007

I need to find all the same words in a text .
What would be the best idea to do that?

I make no claims of this being the best approach:

====================
def findOccurances(a_string, word):
"""
Given a string and a word, returns a double:
[0] = count [1] = list of indexes where word occurs
"""
import re
count = 0
indexes = []
start = 0 # offset for successive passes
pattern = re.compile(r'\b%s\b' % word, re.I)

while True:
match = pattern.search(a_string)
if not match: break
count += 1;
indexes.append(match.start() + start)
start += match.end()
a_string = a_string[match.end():]

return (count, indexes)
====================

Seems to work for me. No guarantees.

Neil Cerutti · Feb 11, 2007

I need to find all the same words in a text .
What would be the best idea to do that?
I used string.find but it does not work properly for the words.
Let suppose I want to find a number 324 in the text

'45 324 45324'

there is only one occurrence of 324 word but string.find() finds 2
occurrences ( in 45324 too)

Must I use regex?
Thanks for help

The first thing to do is to answer the question: What is a word?

The second thing to do is to design some code that can find
words in strings.

The last thing to do is to search those actual words for the word
you're looking for.

=?ISO-8859-1?Q?Ma=EBl_Benjamin_Mettler?= · Feb 11, 2007

In order to find all the words in a text, you need to tokenize it first.
The rest is a matter of calling the count method on the list of
tokenized words. For tokenization look here:
http://nltk.sourceforge.net/lite/doc/en/words.html
A little bit of warning: depending on what exactly you need to do, the
seemingly trivial taks of tokenizing a text can become quite complex.

Enjoy,

Maël

attn.steven.kuo · Feb 11, 2007

I need to find all the same words in a text .
What would be the best idea to do that?

Click to expand...

I make no claims of this being the best approach:

====================
def findOccurances(a_string, word):
"""
Given a string and a word, returns a double:
[0] = count [1] = list of indexes where word occurs
"""
import re
count = 0
indexes = []
start = 0 # offset for successive passes
pattern = re.compile(r'\b%s\b' % word, re.I)

while True:
match = pattern.search(a_string)
if not match: break
count += 1;
indexes.append(match.start() + start)
start += match.end()
a_string = a_string[match.end():]

return (count, indexes)
====================

Seems to work for me. No guarantees.

More concisely:

import re

pattern = re.compile(r'\b324\b')
indices = [ match.start() for match in
pattern.finditer(target_string) ]
print "Indices", indices
print "Count: ", len(indices)

Samuel Karl Peterson · Feb 12, 2007

(e-mail address removed) on 11 Feb 2007 08:16:11 -0800 didst step
forth and proclaim thus:

More concisely:

import re

pattern = re.compile(r'\b324\b')
indices = [ match.start() for match in
pattern.finditer(target_string) ]
print "Indices", indices
print "Count: ", len(indices)

Thank you, this is educational. I didn't realize that finditer
returned match objects instead of tuples.

How can I find occurrences of a column name FPPaymentID in the entire database (e.g table, stored procedure etc) in SSMS?	2	Jun 20, 2023
Genetic algoritm generating the text	0	Aug 18, 2023
Hi, I am a webflow user. I am looking for CSS code that can KEEP ALL ELEMENTS POSITIONED in the SAME spot across all resolutions	0	Oct 27, 2023
How to loop through all the other pages in a pdf using python	3	May 16, 2023
How can I calculate the last payment for Reprofiled Amount column with 2 decimal places to make the sum of all payments to be the same as RC amount?	2	Jul 13, 2023
How to extract all values except the last value in a string separated by comma in sql	2	Jun 15, 2023
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
Function noseen in C++ , how to find solutions?	0	Oct 4, 2023

How to find all the same words in a text?

Johny

Marco Giusti

Johny

ZeD

Marco Giusti

Thorsten Kampe

Samuel Karl Peterson

Neil Cerutti

=?ISO-8859-1?Q?Ma=EBl_Benjamin_Mettler?=

attn.steven.kuo

Samuel Karl Peterson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads