VanL said:

I am working on a project that will require building and querying large

graph objects (initially 8M nodes, 30-40M edges; eventually 40M nodes,

100M edges). NetworkX seems to be the most popular, but I am concerned

that a dict representation for nodes would use too much memory -- my

initial tests suggest that a graph with 3M nodes and 12M edges creates

substantial memory pressure on my machine.

Can anybody who has worked with large graphs before give a recommendation?

My initial tests show otherwise. The below test-script creates 3 million

nodes with 12 million adjacencies, on my 2GB Notebook.

The theoretical limit for this (if we assume pointer-based

adjacency-references which makes sense if we have sparse graphs as your

numbers indicate) is (32 bits assumed):

- 8 bytes per node (4 byte pointer to adjacency list, 4 byte int for

counting the number of adjacencies in that list)

- 4 bytes per adjacency

This is 60.000.000 for your example - roughly 60MB. On my machine, the

process has 320.000.000MB - (roughly) a factor five. Given the much

richer properties a Python-object (and python-lists) have thas is pretty

good I'd say.

So for your eventual size of 40M nodes, 100M edges, we have a

theoretical amount of 560MB, times 5 makes 2.5 GB. Not exactly a low

memory profile, but manageable on modern hardware.

I don't know anything about NetworkX - it still might be the better

solution, given the underlying C-based algorithms. But if all you need

is to represent a graph of that size, it appears to be working.

---- test.py ----

import random

import gc

import time

class Node(object):

__slots__ = ["adjacencies", "value", "id"]

def __init__(self, id):

id = id

value = random.random()

self.adjacencies = []

nodes = []

gc.disable()

nc = 3000000

for i in xrange(nc):

nodes.append(Node(i))

if (i % 1000) == 0:

print i

for i in xrange(12000000):

a = random.randint(0, nc - 1)

b = random.randint(0, nc - 1)

while a == b:

b = random.randint(0, nc)

nodes[a].adjacencies.append(nodes

**)**

if (i % 1000) == 0:

print "e", i

gc.enable()

while True:

time.sleep(1)

traversed = set()

def depth_search(node, depth=0):

traversed.add(node)

if depth == 4:

return

for child in node.adjacencies:

if child not in traversed:

depth_search(child, depth+1)

depth_search(nodes[random.randint(0, nc - 1)])