Bit IO, digests, checksums

Discussion in 'Java' started by Igor Planinc, Nov 15, 2005.

  1. Igor Planinc

    Igor Planinc Guest

    Developing a graph (read: vertex-edge thingy) application I have some decisions
    to make and I'd appreciate some thoughts on a couple of ... issues.

    I'd like to store (into SQL) generated graphs, including their "properties" like
    colourings, various cycles etc. Graphs are simple (no multiple edges nor loops)
    and undirected and therefore their structural data can be stored in (halfs of)
    adjacency matrices (or adjacency lists) with 0/1 values. I would also like those
    graphs to be quickly retrievable (utilizing an SQL index). As various SQL
    servers pose restrictions onto sizes of "indexable" fields (MySQL, for example,
    had 255 char limitation not long ago) I was thinking of computing
    graph-structure digests and make SQL indices on those.

    Now, I'd like to calculate digests using a DigestOutputStream
    (constructor: DigestOutputStream(OutputStream stream, MessageDigest digest)), so
    I need:
    1. Something like BitOutputStream (to pass as s first argument into above
    constructor) which implements, say, writeBit(int) or write(boolean). I'd also
    use such a stream for "serializing" graphs' structural data. I'd also need
    BitInputStream for "deserializing". I found JFAC's implementation but haven't
    tested it, yet. Also: are there any other available (noncommercial)
    implementations out there? Any thoughts on possible (bit-wise) endian issues?
    2. As safe (collision-wise) as possible implementation of MessageDigest limited
    to 255 bytes (or 127, if I decide to store hex notation) computable reasonably
    fast. And of course implemented in Java, preferably Sun spi in their jre. Does
    anyone have any experience with those?

    I'd appreciate any insight.
     
    Igor Planinc, Nov 15, 2005
    #1
    1. Advertising

  2. Igor Planinc

    Igor Planinc Guest

    Igor Planinc, Nov 15, 2005
    #2
    1. Advertising

  3. Igor Planinc

    Roedy Green Guest

    On Tue, 15 Nov 2005 16:32:30 +0100, Igor Planinc <>
    wrote, quoted or indirectly quoted someone who said :

    >I'd appreciate any insight.


    Let's say you could index a node with a Java.util.BitSet indicating
    all the other nodes it is connected to. (I don't think JDBC has any
    such feature.)

    How are you going to generate this key on lookup? You could add it as
    a BLOB, but the data are useless as a key. You key on node number.

    I think what you must do is store edges with start and finish node
    numbers. Those SQL can find. JSut make sure SQL builds an index on
    both the start and finish node numbers.



    Another approach is use a third party package that has solved these
    problems already. See http://mindprod.com/jgloss/graph.html
    though I suspect many won't deal with graphs so big they need SQL.
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 15, 2005
    #3
  4. Andrey Kuznetsov, Nov 15, 2005
    #4
  5. Igor Planinc

    Igor Planinc Guest

    Roedy Green wrote:
    > On Tue, 15 Nov 2005 16:32:30 +0100, Igor Planinc <>
    > wrote, quoted or indirectly quoted someone who said :
    >
    >
    >>I'd appreciate any insight.

    >
    >
    > Let's say you could index a node with a Java.util.BitSet indicating
    > all the other nodes it is connected to. (I don't think JDBC has any
    > such feature.)
    >
    > How are you going to generate this key on lookup?


    Like I said: by generating a digest of graphs structural data. I'm using
    boolean[][] as an adjacency matrix representation (booleans are "enough" since
    the graphs I'm dealing with are simple and undirected; the structure itself is
    not square, it's triangular). So, I was thinking of flushing that boolean[][]
    down some BitOutputStream, generating a digest with DigestOutputStream. I was
    thinking of making a hex representation (a String of 0,1,...,f) of that digest
    so that I can make an efficient and robust SQL index.

    > You could add it as
    > a BLOB, but the data are useless as a key. You key on node number.
    >
    > I think what you must do is store edges with start and finish node
    > numbers. Those SQL can find. JSut make sure SQL builds an index on
    > both the start and finish node numbers.
    >
    >
    >
    > Another approach is use a third party package that has solved these
    > problems already. See http://mindprod.com/jgloss/graph.html
    > though I suspect many won't deal with graphs so big they need SQL.


    My graphs won't be really big (like millions of edges) therefore a storage of a
    single graph would be a problem (there will be many of them, though, so that
    could represent ... an issue ;-) ). They will be bigger than 255 chars, though.
    My problem is this: I want to compute various graph "properties" like
    colourings, various polynomials etc. in advance and store them for quick
    retrieval later. So, given a graph, I'd like to have those properties readily
    available in SQL.
     
    Igor Planinc, Nov 16, 2005
    #5
  6. Igor Planinc

    Igor Planinc Guest

    Igor Planinc wrote:
    > My graphs won't be really big (like millions of edges) therefore a
    > storage of a single graph would be a problem


    I meant: _wouldn't_ be a problem. (I blame the keyboard ;-) )
     
    Igor Planinc, Nov 16, 2005
    #6
  7. Igor Planinc

    Chris Uppal Guest

    Igor Planinc wrote:

    > Like I said: by generating a digest of graphs structural data. I'm using
    > boolean[][] as an adjacency matrix representation (booleans are "enough"
    > since the graphs I'm dealing with are simple and undirected; the
    > structure itself is not square, it's triangular). So, I was thinking of
    > flushing that boolean[][] down some BitOutputStream, generating a digest
    > with DigestOutputStream. I was thinking of making a hex representation (a
    > String of 0,1,...,f) of that digest so that I can make an efficient and
    > robust SQL index.


    > My graphs won't be really big (like millions of edges) therefore a
    > storage of a single graph would be a problem (there will be many of them,
    > though, so that could represent ... an issue ;-) ). They will be bigger
    > than 255 chars, though.


    So you are working with graphs with more than about 64 nodes, if I understand
    you correctly. And are converting the adjacency triangle into a bitmap of
    around 2K bits or more, corresponding to >256 bytes of binary data. I'm
    assuming that you really mean that you will use the digest as a binary object
    (in Java and in the DB) -- if not then by doing so you can cut the length of
    the "string" down by a factor of 16 since a hex string only represents 16 bits
    per byte which is very wasteful (assuming ASCII, if the string in the db is
    Unicode, then it's even worse).

    If you are using a digest algorithm which approximates to randomness (e.g.
    crypto-quality) then the choice of digest would depend mostly on how many
    graphs you need to store digests for. A digest producing an N-bit value would
    be expected to produce 1 collision if used to "index" sqrt(2**N) graphs. The
    various crypto-hashes produce different sized digests, so that may help guide
    your choice. (My guess is that performance probably wouldn't be an issue since
    (a) the time taken to hash a short "string" is probably a lot less than the
    time take to insert/find it in a DB, and (b) you don't need to use a "high
    spec" algorithm but can be fairly safe with a
    (cryptographically)not-very-strong-but-fast algorithm.)

    One other thought is that unless your graphs are quite unusually dense (with a
    50-50 chance of /any/ pair of nodes being connected) the 2K bits will not be
    "random", but will be quite compressible. I'd think about using run-length
    encoding of some kind, or maybe Huffman encoding (presumably the bit patterns:
    00000000
    00000001
    00000010
    ...
    10000000
    will occur much more often than, say, 11111111). Of possibly even use a
    Deflator (from java.utils.zip), though I suspect that might be a long way short
    of ideal for this application). Depending on the level of compression that you
    can achieve (if any) you might be able to get away without using a digest at
    all.

    BTW, your scheme seems to require that you have a way of unambiguously ordering
    the nodes. I don't see how you do that myself (in the abstract), but if your
    application does have that property (or a weak form of it), then you might be
    able to leverage that to improve the compression.

    -- chris
     
    Chris Uppal, Nov 16, 2005
    #7
  8. Igor Planinc

    Roedy Green Guest

    On Wed, 16 Nov 2005 11:38:02 +0100, Igor Planinc <>
    wrote, quoted or indirectly quoted someone who said :

    >>
    >> How are you going to generate this key on lookup?

    >
    >Like I said: by generating a digest of graphs structural data.


    That's how you generate the key to build. How do you generate that key
    on lookup? to find the node you want?
    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 16, 2005
    #8
  9. Igor Planinc

    Igor Planinc Guest

    Chris Uppal wrote:
    > Igor Planinc wrote:
    >
    >
    >>Like I said: by generating a digest of graphs structural data. I'm using
    >>boolean[][] as an adjacency matrix representation (booleans are "enough"
    >>since the graphs I'm dealing with are simple and undirected; the
    >>structure itself is not square, it's triangular). So, I was thinking of
    >>flushing that boolean[][] down some BitOutputStream, generating a digest
    >>with DigestOutputStream. I was thinking of making a hex representation (a
    >>String of 0,1,...,f) of that digest so that I can make an efficient and
    >>robust SQL index.

    >
    >
    >>My graphs won't be really big (like millions of edges) therefore a
    >>storage of a single graph would be a problem (there will be many of them,
    >>though, so that could represent ... an issue ;-) ). They will be bigger
    >>than 255 chars, though.

    >
    >
    > So you are working with graphs with more than about 64 nodes, if I understand


    I'm, that is my application, is currently in a developing stage. That means,
    among other things, that I'm using relatively small graphs. But I'd like to
    "extend" as far as possible. Someday.

    > you correctly. And are converting the adjacency triangle into a bitmap of
    > around 2K bits or more, corresponding to >256 bytes of binary data. I'm
    > assuming that you really mean that you will use the digest as a binary object
    > (in Java and in the DB) --


    I'm using digest(s) (that is I _plan_ to use a digest) only for SQL retrieval.
    Large graphs may require longer keys/values than indexable SQL fields allow.
    Furthermore, unlimited length indexable columns would result in "impossibly"
    large indices (which I'm trying to avoid).

    It seems like I don't know how to describe the (intended) usage right, still,
    let me try once more. Here's a scenario:

    1. For a given graph, I compute/calculate/generate/whatever various graph
    "properties" (colourings, cycles, polynomials, etc.). Since such computations
    tend to be very time consuming, I don't want to do repeat them over and over,
    but rather store the data (into SQL database) for further usage.
    2. I want to be able to retrieve the above mentioned data fast and that means
    that I have to make a proper SQL index on proper fields/columns. "Indexable"
    fields/columns can't be arbitrarily large so I need a "key" with limited (maybe
    fixed) length and that's where digests come in.
    3. On another occasion, given some graph, I digest it and use that digest to
    retrieve the above mentioned data, saving myself the hassle of repeting the
    calculation.

    > if not then by doing so you can cut the length of
    > the "string" down by a factor of 16 since a hex string only represents 16 bits
    > per byte which is very wasteful (assuming ASCII, if the string in the db is
    > Unicode, then it's even worse).
    >
    > If you are using a digest algorithm which approximates to randomness (e.g.
    > crypto-quality) then the choice of digest would depend mostly on how many
    > graphs you need to store digests for. A digest producing an N-bit value would
    > be expected to produce 1 collision if used to "index" sqrt(2**N) graphs. The
    > various crypto-hashes produce different sized digests, so that may help guide
    > your choice. (My guess is that performance probably wouldn't be an issue since
    > (a) the time taken to hash a short "string" is probably a lot less than the
    > time take to insert/find it in a DB, and (b) you don't need to use a "high
    > spec" algorithm but can be fairly safe with a
    > (cryptographically)not-very-strong-but-fast algorithm.)
    >
    > One other thought is that unless your graphs are quite unusually dense (with a
    > 50-50 chance of /any/ pair of nodes being connected) the 2K bits will not be
    > "random", but will be quite compressible. I'd think about using run-length
    > encoding of some kind, or maybe Huffman encoding (presumably the bit patterns:
    > 00000000
    > 00000001
    > 00000010
    > ...
    > 10000000
    > will occur much more often than, say, 11111111). Of possibly even use a
    > Deflator (from java.utils.zip),


    That's an option, yes. But I guess not so much for the digest, but rather for
    other fields (structural data) and "properties". Some SQL servers, IIRC,
    compress binary (blob) content automatically.

    > though I suspect that might be a long way short
    > of ideal for this application). Depending on the level of compression that you
    > can achieve (if any) you might be able to get away without using a digest at
    > all.
    >
    > BTW, your scheme seems to require that you have a way of unambiguously ordering
    > the nodes. I don't see how you do that myself (in the abstract), but if your
    > application does have that property (or a weak form of it), then you might be
    > able to leverage that to improve the compression.
    >
    > -- chris
    >
    >
     
    Igor Planinc, Nov 16, 2005
    #9
  10. Igor Planinc

    Igor Planinc Guest

    Roedy Green wrote:
    > On Wed, 16 Nov 2005 11:38:02 +0100, Igor Planinc <>
    > wrote, quoted or indirectly quoted someone who said :
    >
    >
    >>>How are you going to generate this key on lookup?

    >>
    >>Like I said: by generating a digest of graphs structural data.

    >
    >
    > That's how you generate the key to build. How do you generate that key
    > on lookup? to find the node you want?


    I guess I don't understand you correctly. By loopkup, do you mean an SQL query?
    By node, do you mean an SQL record?

    As I explained in another post, the scenario is this:

    1. For a given graph I perform some time consuming calculations and store the
    results of those calculations (along with the graph data) into database.
    2. Then, at another occasion, given some graph, hoping that I don't have to
    perform those calculations again, I go look into the database if I performed
    them at some previous occasion (in 1.).

    So, a key is a graph (more precisely, its structure: number of vertices and how
    those vertices are interconnected), or rather it's short equivalent: digest. How
    I feed the structure to a DigestOutputStream is not really all that important,
    provided that I do it identically each time.
     
    Igor Planinc, Nov 16, 2005
    #10
  11. Igor Planinc

    Roedy Green Guest

    On Wed, 16 Nov 2005 16:29:24 +0100, Igor Planinc <>
    wrote, quoted or indirectly quoted someone who said :

    >I guess I don't understand you correctly. By loopkup, do you mean an SQL query?
    >By node, do you mean an SQL record?


    The word key has a special meaning in computer science.. Perhaps you
    are using it to mean something else.

    It means a field in an SQL record by which you can rapidly look up
    things. You hand SQL the key, and it finds the corresponding record.
    It is an indexed field.

    The point I have been trying to get across my last few posts is that
    while you can compute the bitset and put it in as a field in a SQL
    record for a node it is USELESS as a key.

    There is no way I could compose the key out of thin air to look up the
    node. I would have to find the node first to compute the key.

    If you want a unique id for the node, just number them sequentially.
    That makes a good key - compact and quick. You can also use it in
    other records to reference the node.


    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 16, 2005
    #11
  12. Igor Planinc

    Igor Planinc Guest

    Roedy Green wrote:
    > On Wed, 16 Nov 2005 16:29:24 +0100, Igor Planinc <>
    > wrote, quoted or indirectly quoted someone who said :
    >
    >
    >>I guess I don't understand you correctly. By loopkup, do you mean an SQL query?
    >>By node, do you mean an SQL record?

    >
    >
    > The word key has a special meaning in computer science.. Perhaps you
    > are using it to mean something else.
    >
    > It means a field in an SQL record by which you can rapidly look up
    > things. You hand SQL the key, and it finds the corresponding record.
    > It is an indexed field.


    So we do understand each other in this part.

    > The point I have been trying to get across my last few posts is that
    > while you can compute the bitset and put it in as a field in a SQL
    > record for a node it is USELESS as a key.


    This part I don't understand.

    > There is no way I could compose the key out of thin air to look up the


    Not out of thin air. From a given graph!

    (writing this ad hoc so bear with me)

    CREATE TABLE GRAPH_DATA (

    DIGEST VARCHAR(255) NOT NULL,

    STRUCTURE BLOB NOT NULL,

    THIS_COLOURING BLOB,
    THAT_COLOURING BLOB,
    CHROMATIC_NUMBER INTEGER,
    EDGE_CHROMATIC_NUMBER INTEGER,
    CIRCUMFERENCE INTEGER,
    DIAMETER INTEGER,
    THICKNESS INTEGER,
    MINIMUM_VERTEX_COVER BLOB,
    AUTOMORPHISMS BLOB,
    HAMILTONIAN_CYCLE BLOB,
    EULERIAN_CYCLE BLOB,
    ETC BLOB,

    UNIQUE INDEX DIG_INDEX (DIGEST)
    )

    So, given a graph G = (V,E), compute a structural digest and (some of the)
    properties and put all that into database with something like

    INSERT INTO GRAPH_DATA VALUES (?, ?, ..., ?)

    Then, at another time, _given_some_graph_, compute a digest and hopefully get
    the rest from the database by

    SELECT STRUCTURE, BLAH1, BLAH2, BLAH3 FROM GRAPH_DATA WHERE DIGEST=?

    and perhaps (to be on the safe side collision-wise) test the two structures (the
    one of a graph in hand and the one retrieved from database) for equality.

    > node. I would have to find the node first to compute the key.
    > If you want a unique id for the node, just number them sequentially.
    > That makes a good key - compact and quick. You can also use it in
    > other records to reference the node.


    Adding ID INTEGER NOT NULL and INDEX ID_INDEX (ID), but that's beside the point.
     
    Igor Planinc, Nov 16, 2005
    #12
  13. Igor Planinc

    Roedy Green Guest

    On Wed, 16 Nov 2005 18:21:11 +0100, Igor Planinc <>
    wrote, quoted or indirectly quoted someone who said :

    >> There is no way I could compose the key out of thin air to look up the

    >
    >Not out of thin air. From a given graph!


    but that is a nutty thing to do. It is like using a person's entire
    genealogical history back 12 generations as their common name. Every
    time you want to say "hello" you have to construct this giant family
    tree. Just give them an number, like 7
    ..
    What is even nuttier is you need some simpler name to even begin
    thinking about constructing the complex name. So why not use that
    name?

    --
    Canadian Mind Products, Roedy Green.
    http://mindprod.com Java custom programming, consulting and coaching.
     
    Roedy Green, Nov 16, 2005
    #13
  14. Igor Planinc

    Igor Planinc Guest

    Roedy Green wrote:
    > On Wed, 16 Nov 2005 18:21:11 +0100, Igor Planinc <>
    > wrote, quoted or indirectly quoted someone who said :
    >
    >
    >>>There is no way I could compose the key out of thin air to look up the

    >>
    >>Not out of thin air. From a given graph!

    >
    >
    > but that is a nutty thing to do. It is like using a person's entire
    > genealogical history back 12 generations as their common name. Every
    > time you want to say "hello" you have to construct this giant family
    > tree. Just give them an number, like 7
    > .
    > What is even nuttier is you need some simpler name to even begin
    > thinking about constructing the complex name. So why not use that
    > name?


    But then I'd have to enumerate all the graphs I'd ever need "propertized". I
    also plan for a distributed application, so a graph would have to have a
    universally unique id. Now that I think of it, it'd also be wise to have a
    digest id, so that I can possibly use different digests or choose another one
    later on if I find out the "current" one proves to be inadequate.

    Anyway, the whole thing is a pita. When it comes to such decisions I tend to be
    indecisive. Because I know what it means when you have to change fundamental
    things in mid-development, let alone in a "finished" product.

    Anyway, thank you for your thoughts.
     
    Igor Planinc, Nov 17, 2005
    #14
  15. Igor Planinc

    Igor Planinc Guest

    Igor Planinc, Nov 17, 2005
    #15
  16. >>>I need:
    >>>1. Something like BitOutputStream (to pass as s first argument into above
    >>>constructor) which implements, say, writeBit(int) or write(boolean).

    >>
    >>
    >> see http://uio.imagero.com
    >> and http://uio.imagero.com/doc/com/imagero/uio/io/BitOutputStream.html

    >
    > Thanks. DLed it, tested it, it meets all my needs. I especially like the
    > BSD license. Great job, Andrey!


    Then don't hesitate to vote for uio (category best java library):
    http://207.178.67.98/java/readerschoice2004/frameindex.cfm

    Cheers

    Andrey

    --
    Andrey Kuznetsov
    http://uio.imagero.com Unified I/O for Java
    http://reader.imagero.com Java image reader
    http://jgui.imagero.com Java GUI components and utilities
     
    Andrey Kuznetsov, Nov 18, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Michael Fairbank

    class checksums

    Michael Fairbank, Dec 5, 2003, in forum: Java
    Replies:
    16
    Views:
    3,131
    Michael Fairbank
    Dec 12, 2003
  2. Chris

    How good are checksums?

    Chris, Apr 29, 2004, in forum: Java
    Replies:
    12
    Views:
    895
    Roedy Green
    Apr 30, 2004
  3. Cousin Stanley

    Debian MD5 Digests

    Cousin Stanley, Jul 27, 2004, in forum: Python
    Replies:
    0
    Views:
    534
    Cousin Stanley
    Jul 27, 2004
  4. Niranjan
    Replies:
    2
    Views:
    245
    Niranjan
    Aug 28, 2008
  5. Matching Java md5 digests

    , Feb 13, 2006, in forum: Perl Misc
    Replies:
    7
    Views:
    460
Loading...

Share This Page