need an idea, recognize sequence, fsm or genetic ?

J

Joh

hello,

here is my trouble, i would to like to write a program which could
help me to detect sequence of consecutive words in list in a very
efficient way. (i need to do it upon large amount of text, and for now
i'm looking for a good start point)

i have many "sentences" of variables sizes, for example one could be :
sentence1 = ['w43', 'w12', 'w41', 'w85', 'w4', 'w74', 'w24', 'w1',
'w6', 'w41', 'w85', 'w7', 'w23', 'w43']
where wx can be seen as a word.

then i had sequence of words which i need to recognize, for example :
sequences = [['w41', 'w85', 'w4'], ['w41', 'w85', 'w7'],] etc.

and need something like span position where sequences are present in
sentence:

{
['w41', 'w85', 'w4']: [[2, 4], ],
['w41', 'w85', 'w7']: [[9, 11], ]
}

for now i'm doing it by using many nested 'for loop', and it is too
slow, as i search for each sequence one by one, and then restart from
beginning, and so on, i'm wondering if this can not be done by using
nested dictionnary or something like that, maybe is there an
improvement to imagine ?
(i'm wondering if i should not find a way to not re-start from
beginning when i had found ['w41', 'w85', ] for the second time and
thus just only look after last token 'w4' or 'w7')

also i think this is related to some kind of finite state automata,
but i do not know how to build automatically massive parallel FSM (i
can have many many sequences and huge amount of sentences)

i was also wondering if some genetics algorithm used to align sequence
of protein could not help me, i had once read something about that ,
but still can not where to start. also as far as i remember such
algorithms take into account that 'sequence' could not be consecutive,
what i need is consecutive.

maybe one of you could help ? give a start point or even an idea that
could help me to design this piece of code ?
 
H

Harald Massa

joh12005,

your problem sounds as it may be solvable with mxtexttools.

they claim to be very fast in analyzing text and splitting into python
tuples. I never fully understood how to deal with them, sorry.

so just google for it and try it.

David Merz wrote "text processing in python" and dealt with this tools.

Sun,

Harald
 
M

Martin DeMello

Joh said:
also i think this is related to some kind of finite state automata,
but i do not know how to build automatically massive parallel FSM (i
can have many many sequences and huge amount of sentences)

here's a naive but clean way to do it

Write a finite state machine class that scans the sentence for a single
sequence. Then instantiate one of them per sequence, and run the
following loop (pseudocode)

for word, index in enumerate sentence
for fsm in sequences
fsm.advance(word)
if fsm.state == END
results[fsm] += (index - fsm.sequence_length, index)
fsm.state == START

A possible optimisation is to combine all your sequences into a single
trie; however, this will also require backtracking so I'm not sure how
much faster it'd be in practice.

martin
 
M

Mike Maxwell

Joh said:
here is my trouble, i would to like to write a program which could
help me to detect sequence of consecutive words in list in a very
efficient way.

These repeats are called "tandem arrays." Algorithms for quickly
finding them (and many other useful patterns) are given in

Gusfield, Dan. 1997. Algorithms on strings,
trees and sequences. Cambridge: Cambridge
University Press.

Gusfield has a web site
http://wwwcsif.cs.ucdavis.edu/~gusfield/
with downloadable software. I don't recall whether his web site talks
about tandem arrays, but I'm sure you could google the term...

Mike Maxwell
 
M

Mike Maxwell

Mike said:
These repeats are called "tandem arrays." Algorithms for quickly
finding them (and many other useful patterns) are given in

Apologies, I just re-read your original post, and (if I understand what
you are looking for), it is not tandem arrays. Nevertheless, Gusfield's
book is on a whole host of text search problems, and may still give you
leads (or even answers).

Mike Maxwell
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top