Sequence and/or pattern matching

  • Thread starter =?iso-8859-1?B?U+li?=
  • Start date
?

=?iso-8859-1?B?U+li?=

Hi everyone,

I'm relatively new to python and I want to write a piece of code who do
the following work for data mining purpose :

1) I have a list of connexion between some computers. This list has
this format :

Ip A Date Ip B
.... ... ...
192.168.0.1 19.10.2005 192.168.0.2
192.168.0.3 19.10.2005 192.168.0.1
192.168.0.4 19.10.2005 192.168.0.6
.... ... ...

2) I want to find if there are unknown sequences of connexions in my
data and if these sequences are repeated along the file :

For example :

Computer A connects to Computer B then
Computer B connects to Computer C then
Computer C connects to Computer A

3) Then, the software gives the sequences it has found and how many
times they appear...

I hope this is clear, point 2) is where I have my main problem. Has
someone an idea where to start and if there's already something coded ?

Thanks

Séb
 
A

Alex Martelli

Séb said:
Hi everyone,

I'm relatively new to python and I want to write a piece of code who do
the following work for data mining purpose :

Essentially, if I understand correctly, you want to detect LOOPS given a
sequence of directed connections A->B. "loop detection" and "graph"
would then be the keywords to search for, in this case.
2) I want to find if there are unknown sequences of connexions in my
data and if these sequences are repeated along the file :

For example :

Computer A connects to Computer B then
Computer B connects to Computer C then
Computer C connects to Computer A

Does this "then" imply you're only interested in loops occurring in this
*sequence*, i.e., is order of connections important? If the sequence of
directed connections was, say, in the different order:

B->C
A->B
C->A

would you want this detected as a loop, or not?


Alex
 
?

=?iso-8859-1?B?U+li?=

Essentially, if I understand correctly, you want to detect LOOPS given a
sequence of directed connections A->B. "loop detection" and "graph"
would then be the keywords to search for, in this case.

Exactly, but the sequence has to be discovered by the piece of code !
Does this "then" imply you're only interested in loops occurring in this
*sequence*, i.e., is order of connections important? If the sequence of
directed connections was, say, in the different order:

B->C
A->B
C->A

would you want this detected as a loop, or not?

Yes, it would be nice to detect it as a loop, with for example a
threshold. Btw, it would be nice to ignore additional connections in
such a way :

B->C # Normal connection
D->E # Additional connection to ignore
A->B # Normal connection
C->A # Normal connection

Would it be possible ?

Thank you very much
 
B

Ben Sizer

Séb said:
1) I have a list of connexion between some computers. This list has
this format :

It looks like you want graph theory.
Ip A Date Ip B
... ... ...
192.168.0.1 19.10.2005 192.168.0.2
192.168.0.3 19.10.2005 192.168.0.1
192.168.0.4 19.10.2005 192.168.0.6

That looks like a list of edges between graph nodes, you see. Each node
corresponds to an address and each edge is a connection between two
nodes - ip addresses, in your case.
2) I want to find if there are unknown sequences of connexions in my
data and if these sequences are repeated along the file :

For example :

Computer A connects to Computer B then
Computer B connects to Computer C then
Computer C connects to Computer A

That looks like finding a path between Node A and Node C. This is a
common application of graph theory, and especially when finding routes
(eg. for train journeys, or for AI characters in computer games).
3) Then, the software gives the sequences it has found and how many
times they appear...

You can find all the routes between 1 node and others by using
depth-first search (or indeed, any other simple graph search
algorithm). Basically, this says that, for any node, examine all the
nodes it leads to. Then examine those nodes... repeat until you run out
of nodes or find where you're looking for. The only complication is
remembering the route you took.
I hope this is clear, point 2) is where I have my main problem. Has
someone an idea where to start and if there's already something coded ?

I found a couple of Python graph libraries on Google but I can't vouch
for their quality. However, writing your own simple graph traversal
algorithm should be quite trivial. I would start by populating a
dictionary where the keys are instances of address A and the values are
lists of address B values for address A. Add each link again with
address B as the key and address A as the value if you need to
represent bidirectional connections. Then you can perform search on
that as required. The only slight complication is avoiding infinite
loops - I might use a dictionary of address->boolean values here and
check off each address as the algorithm progresses.
 
M

Mike Meyer

Séb said:
Exactly, but the sequence has to be discovered by the piece of code !


Yes, it would be nice to detect it as a loop, with for example a
threshold. Btw, it would be nice to ignore additional connections in
such a way :

B->C # Normal connection
D->E # Additional connection to ignore
A->B # Normal connection
C->A # Normal connection

Would it be possible ?

As Ben Sizer pointed out, this is a fairly well-known graph
algorithm. You just want to add information to add ordering
information to each edge, so you can verify that the an edge that is
further down the path is "older" than the last edge added. Given your
ordering requirement, simply numbering the edges and checking that the
"older" edge has a larger number than the previous edge should do.

With my admin hat on, I have to wonder - don't you really want to deal
with duration? I.e - each connection has a "start" and "end" time, and
you're really only interested in paths that happen while all the
connections actually exist? Just a wild ass guess, mind you. However,
it doesn't radically change the test. You now tag each edge with a
start and stop time, and check that all previous edges on the path
exist while the one you are adding exists.

<mike
 
?

=?iso-8859-1?B?U+li?=

Hi everybody,

Thanks for the time taken to answer my question. Unfortunatly, it seems
that there's a little confusion about what I want to do.

In fact, I don't want to search for a particular path between
computers. What I really want is to detect sequences of connection that
are repeated along the log. Is it clearer, if not, I will put another
exmample ;-)

Thank you !

Ps Python community is very nice, I'm glad I learn this language !
 
M

Mike Meyer

Séb said:
Hi everybody,

Thanks for the time taken to answer my question. Unfortunatly, it seems
that there's a little confusion about what I want to do.

In fact, I don't want to search for a particular path between
computers. What I really want is to detect sequences of connection that
are repeated along the log. Is it clearer, if not, I will put another
exmample ;-)

You're right - I was confused. That's why I asked about the timing
issue. But I also indicated that I might be way off base with that.

Everyone may be confused as well - we thought you were looking for
*any* series of connections that started at one computer and ended at
a second computer. That problem is the well-understood problem of
finding a path through a graph, where "path" is used in a
graph-theoretic sense and not a network sense.

From what you say above, it looks like you're looking for a specfic
sequence of connections in your connetions record. That's a much
easier problem. Roughly, you'd do:

def matches(desired, history):
"""Determines if history contains the desired path.
desired = # List of connections we're looking for.
history = # List of connections that were actually made."""

for con in desired:
try:
while not con.matches(history[hp]):
hp += 1
except IndexError:
return False
return True


<mike
 
?

=?iso-8859-1?B?U+li?=

Sorry for the confusion, I think my example was unclear. Thank you Mike
for this piece of code who solves a part of my problem. In fact, the
sequences are unknown at the beginning, so the first part of the code
has to find possible sequences and if those sequences are repeated,
counts how many time they appear (as your code does).

I have found this morning that there's a software produced by i2
software who does this kind of job, but for telephone call analysis.
Maybe the description could help to better understand my goal :

http://www.i2.co.uk/products/Pattern_Tracer/default.asp

Séb
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top