Database specialized in storing directed graphs?

Discussion in 'Python' started by Carl Banks, Oct 28, 2008.

  1. Carl Banks

    Carl Banks Guest

    I was wondering if anyone had any advice on this.

    This is not to study graph theory; I'm using the graph to represent a
    problem domain. The graphs could be arbitrarily large, and could
    easily have millions of nodes, and most nodes have a substantial
    amount of data associated with them. Obviously I don't want a whole
    such graph in memory at once, so libraries the just deal with in-
    memory graphs are out.

    I know I could implement this with a relational DB, and I'd be fine
    with a library that was built on top of one. But I'm hoping for
    specialzed behavior useful for graphs.

    For example, suppose I have a set of nodes called A. It would be
    useful to know if any of these nodes are connected by paths that
    include no nodes in A. I could, of course, do that by reading from
    the database and following the paths, but I clearly don't want to do
    that. I would want instead to somehow cache the outside connectedness
    relationship between different nodes of A. But then what happens if
    something changes elsewhere. How is the cache for A notified, and can
    it be updated efficiently using graph theorems, rather than
    regenerated?

    It's very tricky; that's why I hope someone else has done it.

    I'm guessing no.


    Carl Banks
     
    Carl Banks, Oct 28, 2008
    #1
    1. Advertising

  2. Carl Banks

    Chris Rebert Guest

    On Mon, Oct 27, 2008 at 5:32 PM, Carl Banks <> wrote:
    > I was wondering if anyone had any advice on this.
    >
    > This is not to study graph theory; I'm using the graph to represent a
    > problem domain. The graphs could be arbitrarily large, and could
    > easily have millions of nodes, and most nodes have a substantial
    > amount of data associated with them. Obviously I don't want a whole
    > such graph in memory at once, so libraries the just deal with in-
    > memory graphs are out.
    >
    > I know I could implement this with a relational DB, and I'd be fine
    > with a library that was built on top of one. But I'm hoping for
    > specialzed behavior useful for graphs.
    >
    > For example, suppose I have a set of nodes called A. It would be
    > useful to know if any of these nodes are connected by paths that
    > include no nodes in A. I could, of course, do that by reading from
    > the database and following the paths, but I clearly don't want to do
    > that. I would want instead to somehow cache the outside connectedness
    > relationship between different nodes of A. But then what happens if
    > something changes elsewhere. How is the cache for A notified, and can
    > it be updated efficiently using graph theorems, rather than
    > regenerated?
    >
    > It's very tricky; that's why I hope someone else has done it.
    >
    > I'm guessing no.


    By sacrificing a goat at the altar of the Almighty Google, I was able
    to locate a project I came upon a long while ago but couldn't remember
    the name of that's vaguely like what you want, in that it's a "graph
    database": Neo4j - http://neo4j.org/ (and yes, it's in Java; sigh)
    Not sure it's exactly what you're looking for, but anyway....

    Cheers,
    Chris
    --
    Follow the path of the Iguana...
    http://rebertia.com

    >
    >
    > Carl Banks
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
     
    Chris Rebert, Oct 28, 2008
    #2
    1. Advertising

  3. On 2008-10-28 01:32, Carl Banks wrote:
    > I was wondering if anyone had any advice on this.
    >
    > This is not to study graph theory; I'm using the graph to represent a
    > problem domain. The graphs could be arbitrarily large, and could
    > easily have millions of nodes, and most nodes have a substantial
    > amount of data associated with them. Obviously I don't want a whole
    > such graph in memory at once, so libraries the just deal with in-
    > memory graphs are out.
    >
    > I know I could implement this with a relational DB, and I'd be fine
    > with a library that was built on top of one. But I'm hoping for
    > specialzed behavior useful for graphs.
    >
    > For example, suppose I have a set of nodes called A. It would be
    > useful to know if any of these nodes are connected by paths that
    > include no nodes in A. I could, of course, do that by reading from
    > the database and following the paths, but I clearly don't want to do
    > that. I would want instead to somehow cache the outside connectedness
    > relationship between different nodes of A. But then what happens if
    > something changes elsewhere. How is the cache for A notified, and can
    > it be updated efficiently using graph theorems, rather than
    > regenerated?
    >
    > It's very tricky; that's why I hope someone else has done it.
    >
    > I'm guessing no.


    Aaron Watters is an expert on this and has implemented kjbuckets
    for doing this in memory:

    http://gadfly.sourceforge.net/kjbuckets.html

    Gadfly uses the library to implement relational queries (and works
    on disk):

    http://gadfly.sourceforge.net/

    The package is now maintained by Richard Jones.

    You might be able to reuse some parts of Gadfly for your
    purposes.

    Also have a look at Pygr:

    http://bioinfo.mbi.ucla.edu/pygr

    which is a Python library to build a graph interface on top of
    a relational database.

    --
    Marc-Andre Lemburg
    eGenix.com

    Professional Python Services directly from the Source (#1, Oct 28 2008)
    >>> Python/Zope Consulting and Support ... http://www.egenix.com/
    >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
    >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/

    ________________________________________________________________________

    :::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,MacOSX for free ! ::::


    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
     
    M.-A. Lemburg, Oct 28, 2008
    #3
  4. On Oct 27, 8:32 pm, Carl Banks <> wrote:

    > I was wondering if anyone had any advice on this.
    >
    > This is not to study graph theory; I'm using the graph to represent a
    > problem domain.  The graphs could be arbitrarily large, and could
    > easily have millions of nodes, and most nodes have a substantial
    > amount of data associated with them.  Obviously I don't want a whole
    > such graph in memory at once, so libraries the just deal with in-
    > memory graphs are out.
    >
    > I know I could implement this with a relational DB, and I'd be fine
    > with a library that was built on top of one.  But I'm hoping for
    > specialzed behavior useful for graphs.


    If you're looking for FOSS, the Boost graph library [1] or its
    parallel extension [2] is probably your best bet; it also comes with
    Python bindings but they are not maintained any more. For commercial
    solutions, Star-P [3] seems an interesting platform, with bindings to
    Matlab and Python. Freebase [4] is apparently built on a special graph
    database but unfortunately only the stored data are available, not the
    DB source code.

    George

    [1] http://www.boost.org/doc/libs/1_36_0/libs/graph/doc/index.html
    [2] http://www.osl.iu.edu/research/pbgl/
    [3] http://www.interactivesupercomputing.com/success/sparsematrices.php
    [4] http://www.freebase.com/help/faq#q7
     
    George Sakkis, Oct 28, 2008
    #4
  5. Carl Banks

    Guest

    Sorry Carl Banks for the answering delay, there are problems in Google
    Groups.

    > This is not to study graph theory; I'm using the graph to represent a
    > problem domain. The graphs could be arbitrarily large, and could
    > easily have millions of nodes, and most nodes have a substantial
    > amount of data associated with them. Obviously I don't want a whole
    > such graph in memory at once, so libraries the just deal with in-
    > memory graphs are out.


    It doesn't sound a problem for modern PCs.
    I think you can keep the whole graph topology in RAM, and the node
    data on disk (for example in a file or in DB). The topology is
    represented by the arcs (you have said nothing regarding arc data, so
    I assume it's absent) and the nodes (in RAM you just need a 32 bit
    unsigned integer to represent the index of the node that is stored on
    disk. If memory becomes tight you can use just 3 bytes (2 ^ 24 = 16
    millions different nodes) for the nodes, but generally the memory
    required for the nodes is very little compared to the memory necessary
    to store the arcs).

    You haven't said how many arcs there are, total or average for node,
    and if such arcs are directed or undirected.

    Anyway, using my Graph class (that stores each arc twice), this takes
    about 1 minute and 1.3 GB of RAM (1 million nodes, 10 arcs for node):

    from graph import Graph
    from random import randrange
    g = Graph()
    N = 1000000
    g.addNodes(xrange(N))
    for i in xrange(N * 10):
    g.addArc(randrange(N), randrange(N))


    You have said "could easily have millions of nodes", and every arc may
    have tens of arcs or more.
    ("arbitrarily large" is an unsolvable problem because there are always
    limits in your program, there's no program that's able to work on an
    "arbitrarily large" dataset), so Python data structures become too
    much costly for the RAM. With a lower level language like D/C/C++ you
    can manage a bigger graph. You can use Boost Graph, but a homemade
    graph structure may suffice too for your purposes.

    For the problem you have explained I think a very simple graph
    representation may suffice: an array of integer pairs (starting_index,
    len) (starting_index can be a pointer too), where len is the number of
    outbound arcs of the node n, plus an array of indexes/pointers that
    list the outbound arcs. If memory gets tight you can split this second
    array in two, and use an array of bytes for the lengths (if you have
    more than 256 outbound arcs you may need a short). Note that if you
    use indexes then a Python array.array (or Numpy) suffices.

    In this situation if nnodes = 10_000_000 and narcs/node = 40 (total
    nodes = 40 * 10_000_000):
    nnodes * narcs * 4 + nnodes * (4 + 4) = 1_680_000_000 bytes, that is
    often available on modern PCs.

    On a 64 bit machine the indexes take the same memory, but pointers
    twice that.

    In "memory saving" mode:
    nnodes * narcs * 3 + nnodes * (3 + 1) = 1_240_000_000 bytes.

    A more handy compromise is:
    nnodes * narcs * 3 + nnodes * (4 + 4) = 1_280_000_000 bytes.

    > But then what happens if something changes elsewhere.


    Regarding the data structure, if you use arrays like I have explained,
    if your updates aren't much frequent then you can allocate small extra
    arrays to store more arcs coming out a node (but to do this you
    probably may enjoy using pointers instead of indexes). When you have a
    lot of updated you can save all to disk, and then regenerate the whole
    data structure.

    Bye,
    bearophile
     
    , Oct 28, 2008
    #5
  6. Carl Banks

    flyingfrog Guest

    Don't really know if this will be useful but i'd try pytables:
    http://www.pytables.org/moin
    it deals very well with every kind of hierarchical data sets, doesn't
    matter the size.
    It will let you load only significant data, and you'll be able to
    query your data.
    It's built on top of HDF5 libraries but exposes a very friendly
    pythonic interface.
    Surely you'll still have to implement all of the graph logic yourself
    but this could be a good starting point.
    Hope it helps

    Marco
     
    flyingfrog, Oct 28, 2008
    #6
  7. Carl Banks

    Aaron Brady Guest

    On Oct 27, 7:32 pm, Carl Banks <> wrote:
    > I was wondering if anyone had any advice on this.
    >
    > This is not to study graph theory; I'm using the graph to represent a
    > problem domain.  The graphs could be arbitrarily large, and could
    > easily have millions of nodes, and most nodes have a substantial
    > amount of data associated with them.  Obviously I don't want a whole
    > such graph in memory at once, so libraries the just deal with in-
    > memory graphs are out.
    >
    > I know I could implement this with a relational DB, and I'd be fine
    > with a library that was built on top of one.  But I'm hoping for
    > specialzed behavior useful for graphs.
    >
    > For example, suppose I have a set of nodes called A.  It would be
    > useful to know if any of these nodes are connected by paths that
    > include no nodes in A.  I could, of course, do that by reading from
    > the database and following the paths, but I clearly don't want to do
    > that.  I would want instead to somehow cache the outside connectedness
    > relationship between different nodes of A.  But then what happens if
    > something changes elsewhere.  How is the cache for A notified, and can
    > it be updated efficiently using graph theorems, rather than
    > regenerated?
    >
    > It's very tricky; that's why I hope someone else has done it.
    >
    > I'm guessing no.
    >
    > Carl Banks


    Are you just looking for a persistent graph? The usual options are
    'shelve', 'sqllite', 'mmap', and 'ctypes.Structure'. Or did you need
    a special structure for the 'outside connectedness' properties?

    Storing that is the usual time-space tradeoff. (Actually, I think
    that term refers to something slightly distinct. This would be more
    properly termed read-time/write-time trade off, roughly.)

    Assuming you pick an RDB, you'd have a table of vertices and a table
    of edges. Did you want 'set A' to be a table and persist between runs
    of the program? If not, just keep a set of 2-tuples of nodes that
    satisfy that property. I think the set S you're after is: all x,y
    such that x is a vertex in A, y is a vertex in A, and there exists a P
    such that P is a path, P starts at x, P ends at y, and vertex v in P
    implies that v is x, v is y, or v is not in A. I don't know if you
    can cache any information about A that gives you any shortcuts in
    calculating S, or what the running time or running space of the
    fastest/smallest algorithm is. At worst, it's O( |V| * |E| ) on each
    change to V or E. Unverified.

    You might be able to store the shortest path in a mapping from 2-
    tuples to paths. Or better yet, map each node in A to the set of all
    nodes reachable from it only entering A once. Then, you might save
    time on either adding a node/vertex to A or G, or removing one from A
    or G, or both. Maybe mapping each node in G to that relative set.
    You didn't say whether the graph was directed and/or cyclic.

    A good place to start would be, if you add a node to G, can S
    decrease? No, because the original path P still exists. If you
    remove one from A, still in G, S can stay the same size, might
    decrease. If you remove a node from G, not in A, same, etc.
     
    Aaron Brady, Oct 28, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    1
    Views:
    330
  2. Altu
    Replies:
    2
    Views:
    1,377
    Dimitre Novatchev
    Nov 14, 2007
  3. bukzor

    Directed Graph Traversal

    bukzor, Apr 1, 2008, in forum: Python
    Replies:
    3
    Views:
    1,724
    bukzor
    Apr 3, 2008
  4. Jonathan Wood
    Replies:
    1
    Views:
    520
    Jonathan Wood
    Jun 2, 2008
  5. David Wong
    Replies:
    0
    Views:
    181
    David Wong
    Nov 5, 2012
Loading...

Share This Page