Regular Expressions: large amount of or's

  • Thread starter =?ISO-8859-1?Q?Andr=E9_S=F8reng?=
  • Start date
?

=?ISO-8859-1?Q?Andr=E9_S=F8reng?=

Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?

André
 
T

Tim Peters

[André Søreng]
Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?

Put the words you're looking for into a set (or as the keys of a dict
in older Pythons; the values in the dict are irrelevant).

I don't know what you mean by "word", so write something that breaks
your string into what you mean by words. Then:

for word in something_that_produces_words(the_string):
if word in set_of_words:
# found one

This takes expected-case time proportional to the number of words in
the string, + setup time proportional to the number of "interesting"
words (the time needed to create a set or dict from them). 10,000
interesting words won't even start to strain it.
 
K

Kent Johnson

André Søreng said:
Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?

If you can split some_string into individual words, you could look them up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in known_words ]

Kent
 
J

James Stroud

This does not sound like a job for a single regex.

Using a list and listcomp (say your words are in a list called "mywordlist")
you can make this quite terse. Of course I have a way of writing algorithms
that have very large exp when people tell me the O(N^exp).

try this:


myregexlist = [re.compile(aword) for aword in mywordlist]
myoccurrences = [argx.findall(some_string) for argx in myregexlist]


Now you should have a 1:1 mapping of the mywordlist and myoccurrences. Of
course you can fill mywordlist with real regular expressions instead of just
words. If you want to count the words, you may just want to use the string
count method:


myoccurrences = [some_string.count(aword) for aword in mywordlist]


This may make more sense if you are not using true regexes.

James
 
?

=?ISO-8859-1?Q?Andr=E9_S=F8reng?=

Kent said:
André Søreng said:
Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?


If you can split some_string into individual words, you could look them
up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in
known_words ]

Kent

That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.
 
F

Francis Girard

Le mardi 1 Mars 2005 22:04, André Søreng a écrit :
That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.

Hi,

A lexer producing a DFA like the one in pyggy (see
http://www.lava.net/~newsham/pyggy/) might be what you're looking for.

Regards,

Francis Girard
 
B

Bill Mill

Kent said:
André Søreng said:
Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?


If you can split some_string into individual words, you could look them
up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in
known_words ]

Kent

That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.

Show some initiative, man!
known_words = set(["word1", "word2"])
found_words = [word for word in known_words if word in "word2sgjoisejfisawo rd1yguyg"]
found_words
['word1', 'word2']

Peace
Bill Mill
bill.mill at gmail.com
 
K

Kent Johnson

André Søreng said:
Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

What error do you get? What version of Python are you using? re was changed in Python 2.4 to avoid
recursion, so if you are getting a stack overflow in Python 2.3 you should try 2.4.

Kent
 
N

Nick Craig-Wood

André Søreng said:
Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but
slow).

I wrote a regexp optimiser for exactly this case.

Eg a regexp for all 5 letter words starting with re

$ grep -c '^re' /usr/share/dict/words
2727

$ grep '^re' /usr/share/dict/words | ./words-to-regexp.pl 5

re|re's|reac[ht]|rea(?:d|d[sy]|l|lm|m|ms|p|ps|r|r[ms])|reb(?:el|u[st])|rec(?:ap|ta|ur)|red|red's|red(?:id|o|s)|ree(?:d|ds|dy|f|fs|k|ks|l|ls|ve)|ref|ref's|refe[dr]|ref(?:it|s)|re(?:gal|hab|(?:ig|i)n|ins|lax|lay|lic|ly|mit|nal|nd|nds|new|nt|nts|p)|rep's|rep(?:ay|el|ly|s)|rer(?:an|un)|res(?:et|in|t|ts)|ret(?:ch|ry)|re(?:use|v)|rev's|rev(?:el|s|ue)

As you can see its not perfect.

Find it in http://www.craig-wood.com/nick/pub/words-to-regexp.pl

Yes its perl and rather cludgy but may give you ideas!
 
D

Daniel Yoo

:> Given a string, I want to find all ocurrences of
:> certain predefined words in that string. Problem is, the list of
:> words that should be detected can be in the order of thousands.
:>
:> With the re module, this can be solved something like this:
:>
:> import re
:>
:> r = re.compile("word1|word2|word3|.......|wordN")
:> r.findall(some_string)

The internal data structure that encodes that set of keywords is
probably humongous. An alternative approach to this problem is to
tokenize your string into words, and then check to see if each word is
in a defined list of "keywords". This works if your keywords are
single words:

###
keywords = set([word1, word2, ...])
matchingWords = set(re.findall(r'\w+')).intersection(keywords)
###

Would this approach work for you?



Otherwise, you may want to look at a specialized data structure for
doing mutiple keyword matching; I had an older module that wrapped
around a suffix tree:

http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

It looks like other folks, thankfully, have written other
implementations of suffix trees:

http://cs.haifa.ac.il/~shlomo/suffix_tree/

Another approach is something called the Aho-Corasick algorithm:

http://portal.acm.org/citation.cfm?doid=360825.360855

though I haven't been able to find a nice Python module for this yet.


Best of wishes to you!
 
?

=?ISO-8859-1?Q?Andr=E9_S=F8reng?=

Bill said:
Kent said:
André Søreng wrote:


Hi!

Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?


If you can split some_string into individual words, you could look them
up in a set of known words:

known_words = set("word1 word2 word3 ....... wordN".split())
found_words = [ word for word in some_string.split() if word in
known_words ]

Kent


André

That is not exactly what I want. It should discover if some of
the predefined words appear as substrings, not only as equal
words. For instance, after matching "word2sgjoisejfisaword1yguyg", word2
and word1 should be detected.


Show some initiative, man!

known_words = set(["word1", "word2"])
found_words = [word for word in known_words if word in "word2sgjoisejfisawo
rd1yguyg"]
found_words

['word1', 'word2']

Peace
Bill Mill
bill.mill at gmail.com

Yes, but I was looking for a solution which would scale. Searching
through the same string 10000+++ times does not seem like a suitable
solution.

André
 
?

=?ISO-8859-1?Q?Andr=E9_S=F8reng?=

Daniel said:
:> Given a string, I want to find all ocurrences of
:> certain predefined words in that string. Problem is, the list of
:> words that should be detected can be in the order of thousands.
:>
:> With the re module, this can be solved something like this:
:>
:> import re
:>
:> r = re.compile("word1|word2|word3|.......|wordN")
:> r.findall(some_string)

The internal data structure that encodes that set of keywords is
probably humongous. An alternative approach to this problem is to
tokenize your string into words, and then check to see if each word is
in a defined list of "keywords". This works if your keywords are
single words:

###
keywords = set([word1, word2, ...])
matchingWords = set(re.findall(r'\w+')).intersection(keywords)
###

Would this approach work for you?



Otherwise, you may want to look at a specialized data structure for
doing mutiple keyword matching; I had an older module that wrapped
around a suffix tree:

http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

It looks like other folks, thankfully, have written other
implementations of suffix trees:

http://cs.haifa.ac.il/~shlomo/suffix_tree/

Another approach is something called the Aho-Corasick algorithm:

http://portal.acm.org/citation.cfm?doid=360825.360855

though I haven't been able to find a nice Python module for this yet.


Best of wishes to you!

Thanks, seems like the Aho-Corasick algorithm is along the lines of
what I was looking for, but have not read the article completely yet.

Also:
http://alexandria.tue.nl/extra1/wskrap/publichtml/200407.pdf

provided several alternative algorithms.

André
 
O

Ola Natvig

André Søreng said:
Yes, but I was looking for a solution which would scale. Searching
through the same string 10000+++ times does not seem like a suitable
solution.

André

Just for curiosity, what would a regexp do? Perhaps it's a clue in how
you could do this in the way regexp's are executed.

ola
 
M

Manlio Perillo

[André Søreng]
Given a string, I want to find all ocurrences of
certain predefined words in that string. Problem is, the list of
words that should be detected can be in the order of thousands.

With the re module, this can be solved something like this:

import re

r = re.compile("word1|word2|word3|.......|wordN")
r.findall(some_string)

Unfortunately, when having more than about 10 000 words in
the regexp, I get a regular expression runtime error when
trying to execute the findall function (compile works fine, but slow).

I don't know if using the re module is the right solution here, any
suggestions on alternative solutions or data structures which could
be used to solve the problem?

Put the words you're looking for into a set (or as the keys of a dict
in older Pythons; the values in the dict are irrelevant).

I don't know what you mean by "word", so write something that breaks
your string into what you mean by words. Then:

for word in something_that_produces_words(the_string):
if word in set_of_words:
# found one


I have the same problem.
Unfortunately the meaning of a "word" depends on the word.
As an example I would like to count the number of occurrences of
movies titles in some text.

Maybe lex is more optimized?
Unfortunately is seems that there are no lex versions that generate
python (or PyRex) code.



Thanks and regards Manlio Perillo
 
M

Manlio Perillo

Hi.

Python allows to subclass builtin classes but the Python Interpreter
uses builtin types.
As an example keyword arguments are inserted in a dict but I would
like to use an user defined SortedDict.

There are plans to add such a feature in a future version?


Thanks and regards Manlio Perillo
 
M

Manlio Perillo

Hi.

Python allows to subclass builtin classes but the Python Interpreter
uses builtin types.
As an example keyword arguments are inserted in a dict but I would
like to use an user defined SortedDict.

There are plans to add such a feature in a future version?


Thanks and regards Manlio Perillo
 
D

Daniel Yoo

: Otherwise, you may want to look at a specialized data structure for
: doing mutiple keyword matching; I had an older module that wrapped
: around a suffix tree:

: http://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees/

: It looks like other folks, thankfully, have written other
: implementations of suffix trees:

: http://cs.haifa.ac.il/~shlomo/suffix_tree/

: Another approach is something called the Aho-Corasick algorithm:

: http://portal.acm.org/citation.cfm?doid=360825.360855

: though I haven't been able to find a nice Python module for this yet.


Followup on this: I haven't been able to find one, so I took someone
else's implementation and adapted it. *grin*

Here you go:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

This provides an 'ahocorasick' Python C extension module for doing
matching on a set of keywords. I'll start writing out the package
announcements tomorrow.


I hope this helps!
 
J

John Machin

Daniel said:
Here you go:

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

This provides an 'ahocorasick' Python C extension module for doing
matching on a set of keywords. I'll start writing out the package
announcements tomorrow.

Looks good.

However:

tree.search("I went to alpha beta the other day to pick up some spam")

could use a startpos (default=0) argument for efficiently restarting
the search after finding the first match
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top