Bit IO, digests, checksums

I

Igor Planinc

Developing a graph (read: vertex-edge thingy) application I have some decisions
to make and I'd appreciate some thoughts on a couple of ... issues.

I'd like to store (into SQL) generated graphs, including their "properties" like
colourings, various cycles etc. Graphs are simple (no multiple edges nor loops)
and undirected and therefore their structural data can be stored in (halfs of)
adjacency matrices (or adjacency lists) with 0/1 values. I would also like those
graphs to be quickly retrievable (utilizing an SQL index). As various SQL
servers pose restrictions onto sizes of "indexable" fields (MySQL, for example,
had 255 char limitation not long ago) I was thinking of computing
graph-structure digests and make SQL indices on those.

Now, I'd like to calculate digests using a DigestOutputStream
(constructor: DigestOutputStream(OutputStream stream, MessageDigest digest)), so
I need:
1. Something like BitOutputStream (to pass as s first argument into above
constructor) which implements, say, writeBit(int) or write(boolean). I'd also
use such a stream for "serializing" graphs' structural data. I'd also need
BitInputStream for "deserializing". I found JFAC's implementation but haven't
tested it, yet. Also: are there any other available (noncommercial)
implementations out there? Any thoughts on possible (bit-wise) endian issues?
2. As safe (collision-wise) as possible implementation of MessageDigest limited
to 255 bytes (or 127, if I decide to store hex notation) computable reasonably
fast. And of course implemented in Java, preferably Sun spi in their jre. Does
anyone have any experience with those?

I'd appreciate any insight.
 
R

Roedy Green

I'd appreciate any insight.

Let's say you could index a node with a Java.util.BitSet indicating
all the other nodes it is connected to. (I don't think JDBC has any
such feature.)

How are you going to generate this key on lookup? You could add it as
a BLOB, but the data are useless as a key. You key on node number.

I think what you must do is store edges with start and finish node
numbers. Those SQL can find. JSut make sure SQL builds an index on
both the start and finish node numbers.



Another approach is use a third party package that has solved these
problems already. See http://mindprod.com/jgloss/graph.html
though I suspect many won't deal with graphs so big they need SQL.
 
I

Igor Planinc

Roedy said:
Let's say you could index a node with a Java.util.BitSet indicating
all the other nodes it is connected to. (I don't think JDBC has any
such feature.)

How are you going to generate this key on lookup?

Like I said: by generating a digest of graphs structural data. I'm using
boolean[][] as an adjacency matrix representation (booleans are "enough" since
the graphs I'm dealing with are simple and undirected; the structure itself is
not square, it's triangular). So, I was thinking of flushing that boolean[][]
down some BitOutputStream, generating a digest with DigestOutputStream. I was
thinking of making a hex representation (a String of 0,1,...,f) of that digest
so that I can make an efficient and robust SQL index.
You could add it as
a BLOB, but the data are useless as a key. You key on node number.

I think what you must do is store edges with start and finish node
numbers. Those SQL can find. JSut make sure SQL builds an index on
both the start and finish node numbers.



Another approach is use a third party package that has solved these
problems already. See http://mindprod.com/jgloss/graph.html
though I suspect many won't deal with graphs so big they need SQL.

My graphs won't be really big (like millions of edges) therefore a storage of a
single graph would be a problem (there will be many of them, though, so that
could represent ... an issue ;-) ). They will be bigger than 255 chars, though.
My problem is this: I want to compute various graph "properties" like
colourings, various polynomials etc. in advance and store them for quick
retrieval later. So, given a graph, I'd like to have those properties readily
available in SQL.
 
I

Igor Planinc

Igor said:
My graphs won't be really big (like millions of edges) therefore a
storage of a single graph would be a problem

I meant: _wouldn't_ be a problem. (I blame the keyboard ;-) )
 
C

Chris Uppal

Igor said:
Like I said: by generating a digest of graphs structural data. I'm using
boolean[][] as an adjacency matrix representation (booleans are "enough"
since the graphs I'm dealing with are simple and undirected; the
structure itself is not square, it's triangular). So, I was thinking of
flushing that boolean[][] down some BitOutputStream, generating a digest
with DigestOutputStream. I was thinking of making a hex representation (a
String of 0,1,...,f) of that digest so that I can make an efficient and
robust SQL index.
My graphs won't be really big (like millions of edges) therefore a
storage of a single graph would be a problem (there will be many of them,
though, so that could represent ... an issue ;-) ). They will be bigger
than 255 chars, though.

So you are working with graphs with more than about 64 nodes, if I understand
you correctly. And are converting the adjacency triangle into a bitmap of
around 2K bits or more, corresponding to >256 bytes of binary data. I'm
assuming that you really mean that you will use the digest as a binary object
(in Java and in the DB) -- if not then by doing so you can cut the length of
the "string" down by a factor of 16 since a hex string only represents 16 bits
per byte which is very wasteful (assuming ASCII, if the string in the db is
Unicode, then it's even worse).

If you are using a digest algorithm which approximates to randomness (e.g.
crypto-quality) then the choice of digest would depend mostly on how many
graphs you need to store digests for. A digest producing an N-bit value would
be expected to produce 1 collision if used to "index" sqrt(2**N) graphs. The
various crypto-hashes produce different sized digests, so that may help guide
your choice. (My guess is that performance probably wouldn't be an issue since
(a) the time taken to hash a short "string" is probably a lot less than the
time take to insert/find it in a DB, and (b) you don't need to use a "high
spec" algorithm but can be fairly safe with a
(cryptographically)not-very-strong-but-fast algorithm.)

One other thought is that unless your graphs are quite unusually dense (with a
50-50 chance of /any/ pair of nodes being connected) the 2K bits will not be
"random", but will be quite compressible. I'd think about using run-length
encoding of some kind, or maybe Huffman encoding (presumably the bit patterns:
00000000
00000001
00000010
...
10000000
will occur much more often than, say, 11111111). Of possibly even use a
Deflator (from java.utils.zip), though I suspect that might be a long way short
of ideal for this application). Depending on the level of compression that you
can achieve (if any) you might be able to get away without using a digest at
all.

BTW, your scheme seems to require that you have a way of unambiguously ordering
the nodes. I don't see how you do that myself (in the abstract), but if your
application does have that property (or a weak form of it), then you might be
able to leverage that to improve the compression.

-- chris
 
R

Roedy Green

Like I said: by generating a digest of graphs structural data.

That's how you generate the key to build. How do you generate that key
on lookup? to find the node you want?
 
I

Igor Planinc

Chris said:
Igor Planinc wrote:

Like I said: by generating a digest of graphs structural data. I'm using
boolean[][] as an adjacency matrix representation (booleans are "enough"
since the graphs I'm dealing with are simple and undirected; the
structure itself is not square, it's triangular). So, I was thinking of
flushing that boolean[][] down some BitOutputStream, generating a digest
with DigestOutputStream. I was thinking of making a hex representation (a
String of 0,1,...,f) of that digest so that I can make an efficient and
robust SQL index.

My graphs won't be really big (like millions of edges) therefore a
storage of a single graph would be a problem (there will be many of them,
though, so that could represent ... an issue ;-) ). They will be bigger
than 255 chars, though.


So you are working with graphs with more than about 64 nodes, if I understand

I'm, that is my application, is currently in a developing stage. That means,
among other things, that I'm using relatively small graphs. But I'd like to
"extend" as far as possible. Someday.
you correctly. And are converting the adjacency triangle into a bitmap of
around 2K bits or more, corresponding to >256 bytes of binary data. I'm
assuming that you really mean that you will use the digest as a binary object
(in Java and in the DB) --

I'm using digest(s) (that is I _plan_ to use a digest) only for SQL retrieval.
Large graphs may require longer keys/values than indexable SQL fields allow.
Furthermore, unlimited length indexable columns would result in "impossibly"
large indices (which I'm trying to avoid).

It seems like I don't know how to describe the (intended) usage right, still,
let me try once more. Here's a scenario:

1. For a given graph, I compute/calculate/generate/whatever various graph
"properties" (colourings, cycles, polynomials, etc.). Since such computations
tend to be very time consuming, I don't want to do repeat them over and over,
but rather store the data (into SQL database) for further usage.
2. I want to be able to retrieve the above mentioned data fast and that means
that I have to make a proper SQL index on proper fields/columns. "Indexable"
fields/columns can't be arbitrarily large so I need a "key" with limited (maybe
fixed) length and that's where digests come in.
3. On another occasion, given some graph, I digest it and use that digest to
retrieve the above mentioned data, saving myself the hassle of repeting the
calculation.
if not then by doing so you can cut the length of
the "string" down by a factor of 16 since a hex string only represents 16 bits
per byte which is very wasteful (assuming ASCII, if the string in the db is
Unicode, then it's even worse).

If you are using a digest algorithm which approximates to randomness (e.g.
crypto-quality) then the choice of digest would depend mostly on how many
graphs you need to store digests for. A digest producing an N-bit value would
be expected to produce 1 collision if used to "index" sqrt(2**N) graphs. The
various crypto-hashes produce different sized digests, so that may help guide
your choice. (My guess is that performance probably wouldn't be an issue since
(a) the time taken to hash a short "string" is probably a lot less than the
time take to insert/find it in a DB, and (b) you don't need to use a "high
spec" algorithm but can be fairly safe with a
(cryptographically)not-very-strong-but-fast algorithm.)

One other thought is that unless your graphs are quite unusually dense (with a
50-50 chance of /any/ pair of nodes being connected) the 2K bits will not be
"random", but will be quite compressible. I'd think about using run-length
encoding of some kind, or maybe Huffman encoding (presumably the bit patterns:
00000000
00000001
00000010
...
10000000
will occur much more often than, say, 11111111). Of possibly even use a
Deflator (from java.utils.zip),

That's an option, yes. But I guess not so much for the digest, but rather for
other fields (structural data) and "properties". Some SQL servers, IIRC,
compress binary (blob) content automatically.
 
I

Igor Planinc

Roedy said:
That's how you generate the key to build. How do you generate that key
on lookup? to find the node you want?

I guess I don't understand you correctly. By loopkup, do you mean an SQL query?
By node, do you mean an SQL record?

As I explained in another post, the scenario is this:

1. For a given graph I perform some time consuming calculations and store the
results of those calculations (along with the graph data) into database.
2. Then, at another occasion, given some graph, hoping that I don't have to
perform those calculations again, I go look into the database if I performed
them at some previous occasion (in 1.).

So, a key is a graph (more precisely, its structure: number of vertices and how
those vertices are interconnected), or rather it's short equivalent: digest. How
I feed the structure to a DigestOutputStream is not really all that important,
provided that I do it identically each time.
 
R

Roedy Green

I guess I don't understand you correctly. By loopkup, do you mean an SQL query?
By node, do you mean an SQL record?

The word key has a special meaning in computer science.. Perhaps you
are using it to mean something else.

It means a field in an SQL record by which you can rapidly look up
things. You hand SQL the key, and it finds the corresponding record.
It is an indexed field.

The point I have been trying to get across my last few posts is that
while you can compute the bitset and put it in as a field in a SQL
record for a node it is USELESS as a key.

There is no way I could compose the key out of thin air to look up the
node. I would have to find the node first to compute the key.

If you want a unique id for the node, just number them sequentially.
That makes a good key - compact and quick. You can also use it in
other records to reference the node.
 
I

Igor Planinc

Roedy said:
The word key has a special meaning in computer science.. Perhaps you
are using it to mean something else.

It means a field in an SQL record by which you can rapidly look up
things. You hand SQL the key, and it finds the corresponding record.
It is an indexed field.

So we do understand each other in this part.
The point I have been trying to get across my last few posts is that
while you can compute the bitset and put it in as a field in a SQL
record for a node it is USELESS as a key.

This part I don't understand.
There is no way I could compose the key out of thin air to look up the

Not out of thin air. From a given graph!

(writing this ad hoc so bear with me)

CREATE TABLE GRAPH_DATA (

DIGEST VARCHAR(255) NOT NULL,

STRUCTURE BLOB NOT NULL,

THIS_COLOURING BLOB,
THAT_COLOURING BLOB,
CHROMATIC_NUMBER INTEGER,
EDGE_CHROMATIC_NUMBER INTEGER,
CIRCUMFERENCE INTEGER,
DIAMETER INTEGER,
THICKNESS INTEGER,
MINIMUM_VERTEX_COVER BLOB,
AUTOMORPHISMS BLOB,
HAMILTONIAN_CYCLE BLOB,
EULERIAN_CYCLE BLOB,
ETC BLOB,

UNIQUE INDEX DIG_INDEX (DIGEST)
)

So, given a graph G = (V,E), compute a structural digest and (some of the)
properties and put all that into database with something like

INSERT INTO GRAPH_DATA VALUES (?, ?, ..., ?)

Then, at another time, _given_some_graph_, compute a digest and hopefully get
the rest from the database by

SELECT STRUCTURE, BLAH1, BLAH2, BLAH3 FROM GRAPH_DATA WHERE DIGEST=?

and perhaps (to be on the safe side collision-wise) test the two structures (the
one of a graph in hand and the one retrieved from database) for equality.
node. I would have to find the node first to compute the key.
If you want a unique id for the node, just number them sequentially.
That makes a good key - compact and quick. You can also use it in
other records to reference the node.

Adding ID INTEGER NOT NULL and INDEX ID_INDEX (ID), but that's beside the point.
 
R

Roedy Green

Not out of thin air. From a given graph!

but that is a nutty thing to do. It is like using a person's entire
genealogical history back 12 generations as their common name. Every
time you want to say "hello" you have to construct this giant family
tree. Just give them an number, like 7
..
What is even nuttier is you need some simpler name to even begin
thinking about constructing the complex name. So why not use that
name?
 
I

Igor Planinc

Roedy said:
but that is a nutty thing to do. It is like using a person's entire
genealogical history back 12 generations as their common name. Every
time you want to say "hello" you have to construct this giant family
tree. Just give them an number, like 7
.
What is even nuttier is you need some simpler name to even begin
thinking about constructing the complex name. So why not use that
name?

But then I'd have to enumerate all the graphs I'd ever need "propertized". I
also plan for a distributed application, so a graph would have to have a
universally unique id. Now that I think of it, it'd also be wise to have a
digest id, so that I can possibly use different digests or choose another one
later on if I find out the "current" one proves to be inadequate.

Anyway, the whole thing is a pita. When it comes to such decisions I tend to be
indecisive. Because I know what it means when you have to change fundamental
things in mid-development, let alone in a "finished" product.

Anyway, thank you for your thoughts.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,525
Members
44,997
Latest member
mileyka

Latest Threads

Top