Graph Data Structures

N

Nathan Harmston

Hi All,

Currently I am working on a generic graph library so I can do various
graph based analysis for various projects I have ideas for. Currently
I am implementing Graph as a wrapper around a dictionary. Currently my
implementation works like this:

t = Graph()
n1 = Node("Node1")
n2 = Node("Test2")
edge1 = Edge("Test3")
t += n1 { n1:{}}
t[n1][n2] = edge1 { n1:{n2:edge1}

However this isnt actually ending up with the structure I want. I want
it to finally end up as ...... { n1:{n2:edge1}, n2:{}}. Is
there anyway I can do this simply????

Also I am looking at having a large graph and was wondering if anyone
knew of anyway I could reduce the memory requirements of this
structure and improve the speed of queries on it. I m thinking writing
a C extension for it....is this a good idea and where would I start?
Or does Python have some kind of transparent memory access module I
can implement.

Many Thanks in advance,

Nathan

PS.....Please find my code below:

class Graph(object):
def __init__(self, g= { } ):
self.graph = g
def __iadd__(self, p):
if p not in self.graph:
self.graph[p] = PathsDict()
return self
def __getitem__(self, p):
try:
return self.graph[p]
except KeyError:
raise KeyError( "%s not in graph" %(repr(p)) )
def __str__(self):
return str(self.graph)
def filter(self, filter):
pass

class PathsDict(object):
def __init__(self):
self.paths = { }
def __setitem__(self, p, val):
if p not in self.paths:
self.paths[p] = val
def __getitem__(self, p):
return self.paths[p]
# catch exception here
def paths(self):
for k, v in self.paths:
yield (k, v)
def edges(self):
return self.paths.values()
def __str__(self):
return str(self.paths)
def __len__(self):
return len(self.paths)

class Node(object):
def __init__(self, name):
self.name = name
def __str__(self):
return self.name

class Edge(dict):
def __init__(self, name, weight = 1):
self["name"] = name
self["weight"] = weight
def __str__(self):
return self["name"]
 
B

bearophileHUGS

Szabolcs Nagy:
i haven't read your code, but there are many graph implementations in
python.
in case you haven't found these yet:
http://wiki.python.org/moin/PythonGraphApi

if you only want to do some analysis i think you need this one (as it's
pretty complete and simple):
https://networkx.lanl.gov/

i also recommend Guido's essay to read:
http://www.python.org/doc/essays/graphs.html

I can also suggest my one:
http://sourceforge.net/projects/pynetwork/

And boost graph bindings for Python, quite fast:
http://www.osl.iu.edu/~dgregor/bgl-python/

Bye,
bearophile
 
N

Nathan Harmston

https://networkx.lanl.gov/

This was working for me earlier, I managed to get everything from
there earlier. It seems a very good package. It seems theres more out
there than what I had thought, which unfortunately makes it harder for
me to decide what to use (pynetwork and bgl look useful aswell). I m
going to do some testing on it later and see what happens with it.
Thanks a lot for your help.

Has anyone got an idea how I could split the contents of a node and
its representation (to save memory in my graph). ie.... the nodes
contain the start and end coordinates and id and the actual
representation contains the string. I was going to have :

class Node(object):
pass

class Section(Node):
pass

class Item(object):
pass

Where section contains a slice of the Item which im interested. I m
just not sure how I can access the contents of item without storing
it. ---> If u get what I mean???

Many Thanks in advance

Nathan
 
J

John Machin

Nathan said:
This was working for me earlier, I managed to get everything from
there earlier. It seems a very good package. It seems theres more out
there than what I had thought, which unfortunately makes it harder for
me to decide what to use (pynetwork and bgl look useful aswell). I m
going to do some testing on it later and see what happens with it.
Thanks a lot for your help.

Has anyone got an idea how I could split the contents of a node and
its representation (to save memory in my graph). ie.... the nodes
contain the start and end coordinates and id and the actual
representation contains the string. I was going to have :

class Node(object):
pass

class Section(Node):
pass

class Item(object):
pass

Where section contains a slice of the Item which im interested. I m
just not sure how I can access the contents of item without storing
it. ---> If u get what I mean???

No. Not at all. "pass" is not very informative. Neither are
"representation" and "the string". Please tell us what you mean by
"slice". What is an "item", if it's not a "node"? Try listing out the
attributes of a node, with a couple of sample values for each, and then
we might get a clue.

What makes you think that you need to save memory?

What makes you think that you could save memory by splitting whatever
it is?

HTH,
John
 
R

Robin Becker

.......

I got it back just once, but when I clicked again I see this



RuntimeError Python 2.4.4c1: /usr/bin/python
Sat Nov 25 16:21:16 2006

A problem occurred in a Python script. Here is the sequence of function
calls leading up to the error, in the order they occurred.
/build/bdist.linux-x86_64/egg/tracrst/macro.py in
render_macro(self=<tracrst.macro.TracReSTMacro object>, req=<trac.web.api

.......

782 self.__dict__["_parent_pool"] = \
783 parent_pool or libsvn.core.application_pool;
784 if self.__dict__["_parent_pool"]:
self = <libsvn.repos.svn_repos_t; proxy of C svn_repos_t instance>,
self.__dict__ = {'this': <Swig Object of type 'svn_repos_t *'>},
parent_pool = <libsvn.core.apr_pool_t; proxy of C apr_pool_t instance>,
libsvn = <module 'libsvn' from
'/usr/lib/python2.4/site-packages/libsvn/__init__.pyc'>, libsvn.core =
<module 'libsvn.core' from
'/usr/lib/python2.4/site-packages/libsvn/core.pyc'>,
libsvn.core.application_pool = <libsvn.core.apr_pool_t; proxy of C
apr_pool_t instance>

RuntimeError: instance.__dict__ not accessible in restricted mode
args = ('instance.__dict__ not accessible in restricted mode',)


perhaps I'm seeing different apache processes or something
 
N

Nathan Harmston

Hi,

The idea is that I m going to use it to build graphs for sequence
alignment (at the moment), I read a discussion on the corebio
(reimplementation of biopython) group about using intervals to
represent sequence slices. The idea being that, my graph may contain
millions of alignments and storing the sequence (the actual ATGC) is
not required.

class Node(object):
pass

class Interval(Node):
_id = "gene1"
_start = 50
_end = 200
_strand = 1

class Sequence(object):
_sequence = "atgtcgtgagagagagttgtgag................."

So one interval on one sequence would align to another interval from
another sequence, but I want changes I make to the interval to be
reflected in the representation later. If I reverse complement it i
want the interval to store this information but the Sequence only
shows this later on when I call use it calling repr or str.

Do you get what I mean.
Many Thanks

Nathan
 
J

John Machin

Nathan said:
Hi,

The idea is that I m going to use it to build graphs for sequence
alignment (at the moment), I read a discussion on the corebio
(reimplementation of biopython) group about using intervals to
represent sequence slices. The idea being that, my graph may contain
millions of alignments and storing the sequence (the actual ATGC) is
not required.

class Node(object):
pass

class Interval(Node):
_id = "gene1"
_start = 50
_end = 200
_strand = 1

What is the point of subclassing Node if it's just a dummy?
class Sequence(object):
_sequence = "atgtcgtgagagagagttgtgag................."

So one interval on one sequence would align to another interval from
another sequence, but I want changes I make to the interval to be
reflected in the representation later. If I reverse complement it i
want the interval to store this information but the Sequence only
shows this later on when I call use it calling repr or str.

Do you get what I mean.

Only vaguely. You use several terms which appear to be from your trade
jargon as they are not understandable when interpreted in either the
context of Python-speak or ordinary English e.g. "sequence",
"alignment", "ATGC", "reverse complement", "interval".

Two options:
(a) communicate understandably
(b) wait till your wontoks are back from holidays.
 
N

Nathan Harmston

Hi,

It seems that by just going through the problem writing out a better
explanation for the reply I have figured out a solution and the
problem isnt as difficult as I thought it would be.

What is a wontok?

Thanks

Nathan

PS --> the start of my reply:

class Interval(object):
_id = "gene1"
_start = 50
_end = 200
_strand = 1

class Sequence(object):
_sequence = "atgtcgtgagagagagttgtgag................."
Only vaguely. You use several terms which appear to be from your trade
jargon

Sequence is a string made from a restricted alphabet (A,T,G,C...).
Sequences can be aligned: 1 ATGCTGCAT
2 TAGCTGTTA
-------
2 5

I m trying to represent this as a graph Interval(id=1, start=2, end=6,
strand=1) ---edge------Interval(id=2, start=2, end=6, strand=1)

The problem is I was planning on storing the sequences in a dictionary
{id:Seq}, however each dictionary would represent a different source
of sequences. File1, File2....... (
STORE THE SOURCES AS A DICT AND HAVE SOURCE IN INTERVAL ASWELL
 
J

John Machin

Nathan said:
Hi,

It seems that by just going through the problem writing out a better
explanation for the reply I have figured out a solution and the
problem isnt as difficult as I thought it would be.

Often happens.
What is a wontok?

It's Melanesian Pidgin (from the English "one talk") meaning a person
who speaks the same language as you, a member of your clan, ... the
context being that [at least in Papua New Guinea] there are relatively
many languages each with relatively not many speakers :)
Thanks

Nathan

PS --> the start of my reply:

class Interval(object):
_id = "gene1"
_start = 50
_end = 200
_strand = 1

class Sequence(object):
_sequence = "atgtcgtgagagagagttgtgag................."


Sequence is a string made from a restricted alphabet (A,T,G,C...).
Sequences can be aligned: 1 ATGCTGCAT
2 TAGCTGTTA

I'm sure they can be, but appearances can be deceptive when you mix
tabs and spaces -- or whatever caused the above 4 lines to be not
vertically aligned but staggered diagonally like a flight of ducks
heading equatorwards for winter.

Sometimes a line of code (e.g. str1[2:6] == str2[2:6]) is worth a
thousand pictures :)
I m trying to represent this as a graph Interval(id=1, start=2, end=6,
strand=1) ---edge------Interval(id=2, start=2, end=6, strand=1)

The problem is I was planning on storing the sequences in a dictionary
{id:Seq}, however each dictionary would represent a different source
of sequences. File1, File2....... (
STORE THE SOURCES AS A DICT

Mapping what keys to what values?
AND HAVE SOURCE IN INTERVAL ASWELL

So you had a data modelling problem. These are often better solved as a
separate step before you think about implementation details like
dictionaries.

Good luck with your project.

Cheers,
John
 
P

Pieter Swart

Nathan said:
Currently I am working on a generic graph library so I can do various
graph based analysis for various projects I have ideas for. Currently
I am implementing Graph as a wrapper around a dictionary. Currently my
implementation works like this:

t = Graph()
n1 = Node("Node1")
n2 = Node("Test2")
edge1 = Edge("Test3")
t += n1 { n1:{}}
t[n1][n2] = edge1 { n1:{n2:edge1}

However this isnt actually ending up with the structure I want. I want
it to finally end up as ...... { n1:{n2:edge1}, n2:{}}. Is
there anyway I can do this simply????

Nathan

By now you probably discovered that the networkx package can handle
this.
If I have this right, you want to create a digraph with
a directed edge from "Node1" to "Node2" and this edge
has the string "Test3" attached to it. In networkx, this is exacty what
the XDiGraph class was designed to do. Here DiGraph means
directed graph and the X means you are allowed to add (any)
data to the edge,for example:
Also I am looking at having a large graph and was wondering if anyone
knew of anyway I could reduce the memory requirements of this
structure and improve the speed of queries on it. I m thinking writing
a C extension for it....is this a good idea and where would I start?
Or does Python have some kind of transparent memory access module I
can implement.

Networkx was designed so that you can hook your own
C extension in. However, making it ispeed or memory efficient
is quite application dependent. I am still not clear as to exactly what

class of algorithms you want to implement via a string-interval
representation, and whether you demand exact alignment or whether
missing/incorrect data etc. is allowed as part of the alignment
problem.

HTH
Pieter Swart
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top