very large graph

C

chrispoliquin

I need to represent the hyperlinks between a large number of HTML
files as a graph. My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization. I'm just not sure which modules are
best. I need to be able to do the following:

1) The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

3) The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes. I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?
 
B

bearophileHUGS

(e-mail address removed):
My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

That's large, but today not very large anymore. Today very large
graphs probably have more than millions of nodes...
You have to try, but I think any Python graph lib may be fit for your
purpose.
But I think a list of pairs won't do.

For larger graphs you can use Boost graph too, but it may be overkill
for your purposes.
If your graphs are even larger, you can implement a graph with
DataDraw in C and use the compiled module from Python, this is
probably among the faster possible near-prebuilt data structures.
For even larger graphs you may need the "external" STXXL with C++, and
then custom code.

The list of the operations you have to perform on the graph seems very
simple, so any graph lib may be enough, mine too:
http://sourceforge.net/projects/pynetwork/

For 63,000 nodes and 500,000 edges Graphviz may be too much slow, so
you may need to find a faster visualization tool. Two (or more) of
them are free too.

Ask if you need more information.

Bye,
bearophile
 
M

MRAB

I need to represent the hyperlinks between a large number of HTML
files as a graph.  My non-directed graph will have about 63,000 nodes
and and probably close to 500,000 edges.

I have looked into igraph (http://cneurocvs.rmki.kfki.hu/igraph/doc/
python/index.html) and networkX (https://networkx.lanl.gov/wiki) for
generating a file to store the graph, and I have also looked into
Graphviz for visualization.  I'm just not sure which modules are
best.  I need to be able to do the following:

1)  The names of my nodes are not known ahead of time, so I will
extract the title from all the HTML files to name the nodes prior to
parsing the files for hyperlinks (edges).

2) Every file will be parsed for links and nondirectional connections
will be drawn between the two nodes.

3)  The files might link to each other so the graph package needs to
be able to check to see if an edge between two nodes already exists,
or at least not double draw connections between the two nodes when
adding edges.

I'm relatively new to graph theory so I would greatly appreciate any
suggestions for filetypes.  I imagine doing this as a python
dictionary with a list for the edges and a node:list paring is out of
the question for such a large graph?

Perhaps a dictionary where the key is a node and the value is a set of
destination nodes?
 
C

castironpi

Afaik Graphviz is not good at abstracting the graph, which you may need here.
A page with 63,000 circles on it, and 500,000 edges will probably come out of
the printer as a black sheet of paper.
(8"x11" paper, 1 circle is 1/5", then you have only 2200 circles at one sheet.
You need a factor 28 more circles which leads to a circle of about 0.007"..)

Even when actual paper format would not be a problem, you will need some
abstraction and/or tools, as you will not find anything manually in an ocean of
63,000 elements.

One area you may want to start looking for tools is in state graphs, where the
set of possible states of an entire system is unfolded. These things go up to
over a million states, so you only have a 'small' problem there...







For undirected edges, you could make an Edge class and have a set of Edge's
(where two Edge objects are equal when they connect the same nodes).
I don't expect 500,000 elements in a set to be a problem.

Sincerely,
Albert- Hide quoted text -

- Show quoted text -

Readability counts: Do you need to be able to examine the datafile by
hand? If not, investigate the 'shelve' module. You can go two ways.
One is to map each node to a list of nodes. It's probably more
intuitive, but it needs repeated code:

shelf['pageA'].add( pageB )
shelf['pageB']= pageB
shelf['pageB'].add( pageA )

Which is incorrect as is anyway-- updates to shelved objects aren't
committed automatically.

a= shelf['pageA']
a.add( pageB )
shelf['pageA']= a
b= shelf['pageB']
b.add( pageA )
shelf['pageB']= b

is closer. Otherwise, two is to store a 2-tuple of node keys in a
secondary file.

shelfEdges[ frozenset(
shelfVerts[ 'pageA' ], shelfVerts[ 'pageB' ]
) ]= True
 
B

boggom

Drawing a large graph like this is not
very insightful by itself, and doing this well
is still an art form.
Many cool visualizations, and all
very domain and question dependent,
can be found at
http://www.visualcomplexity.com/vc/

You can also search on flickr for network
and graph drawing.

Much of it is eyecandy though. What do you
really want to know?

Your graph is not overly huge for python, provided
you do not want to compute nonlocal statistics
such as betweenness metrics. (If you have
a laptop running windows and with
less than 1GB RAM, then you have a really large graph.)

Given that you do not know much about graph theory,
I would suggest that networkx applied to a
subset of your large website --- one of the
representative branches --- will allow
for the fastest way to figure out what you
want to do. Python provides access to many other
libraries for html mangling or even webpage
analysis. Imagine writing C code to ask questions
like: how many pdfs and images? how big are they?
when were these pages last updated? etc.
You can come up with 10 questions faster
than you can write C code, and this where
python has its great advantage.

You can learn graph theory while using these libraries,
on one of the smaller branches of your website tree.
Even interactively by using ipython.
This provides at least a feeling of great power while
stumbling in the dark.

Many of your edges are probably repetitions
associated with navigation menus to provide
a standard look-and-feel. See how much of that
you can strip out and find the cross-links that took
effort to insert an manage. (I suggest
doing the analyis on a digraph rather than a graph,
even if you want to draw it as graph.)


For visualization, the new ubigraph is quite
fast compared to graphviz.
See the cool networkx + ubigraph
video at http://ubietylab.net/blog/


Ask on the networkx mailinglist when stuck.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top