Storing large strings for future equality checks

Discussion in 'Java' started by Abu Yahya, Jun 8, 2011.

  1. Abu Yahya

    Abu Yahya Guest

    A small application that I'm making requires me to store very long
    strings (>1000 characters) in a database.

    I will need to use these strings later to compare for equality to
    incoming strings from another application. I will also want to add some
    of the incoming strings to the storage, if they meet certain criteria.

    For my application, I get a feeling that storing these strings in my
    table will be a waste of space, and will impact performance due to
    retrieval and storage times, as well as comparison times.

    I considered using an SHA-512 hash of these strings and storing them in
    the database. However, while these will save on storage space, it will
    take time to do the hashing before comparing an incoming string. So I'm
    still wasting time. (Collisions due to hashing will not be a problem,
    since an occasional false positive will not be fatal for my application).

    What would be the best approach?
    Abu Yahya, Jun 8, 2011
    #1
    1. Advertising

  2. Abu Yahya

    markspace Guest

    On 6/8/2011 9:35 AM, Abu Yahya wrote:

    > I considered using an SHA-512 hash of these strings and storing them in
    > the database. However, while these will save on storage space, it will
    > take time to do the hashing before comparing an incoming string. So I'm
    > still wasting time. (Collisions due to hashing will not be a problem,
    > since an occasional false positive will not be fatal for my application).



    You have to store the whole string. Even if the SHA-512 hash codes are
    equal, it could be that the strings are different. You'll have to
    eventually compare the raw string, even if the SHA is used as a
    quick-out case.

    No one can really tell what is "faster" or "wasting time" until you can
    better characterize the usage patterns. How big can these strings get?
    How often will you get an actual duplicate? What's the penalty when
    you need to add a new string? You'll need to implement a few
    algorithms, profile them and then make a decision based on actual data.

    For Java, I'd store the strings in a WeakHashMap or similar to allow
    them to be cached, but tossed away when more storage is needed. Also
    you should look into getting some DB caching library, much easier than
    implementing this yourself (sorry I can't personally recommend any).
    markspace, Jun 8, 2011
    #2
    1. Advertising

  3. Abu Yahya

    David Kerber Guest

    [This followup was posted to comp.lang.java.databases and a copy was
    sent to the cited author.]

    In article <iso8cm$a80$>,
    says...
    >
    > A small application that I'm making requires me to store very long
    > strings (>1000 characters) in a database.


    Unless you're storing millions of these strings or using Access, I'd say
    just store and use the string. I think you'll find that the speed
    penalty is nearly unnoticeable.

    D
    David Kerber, Jun 8, 2011
    #3
  4. Abu Yahya

    Willem Guest

    markspace wrote:
    ) On 6/8/2011 9:35 AM, Abu Yahya wrote:
    )> I considered using an SHA-512 hash of these strings and storing them in
    )> the database. However, while these will save on storage space, it will
    )> take time to do the hashing before comparing an incoming string. So I'm
    )> still wasting time. (Collisions due to hashing will not be a problem,
    )> since an occasional false positive will not be fatal for my application).
    )
    ) You have to store the whole string. Even if the SHA-512 hash codes are
    ) equal, it could be that the strings are different. You'll have to
    ) eventually compare the raw string, even if the SHA is used as a
    ) quick-out case.

    No he doesn't. Read again. Especially the last bit between parentheses.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
    Willem, Jun 8, 2011
    #4
  5. On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <>
    wrote:

    >A small application that I'm making requires me to store very long
    >strings (>1000 characters) in a database.
    >
    >I will need to use these strings later to compare for equality to
    >incoming strings from another application. I will also want to add some
    >of the incoming strings to the storage, if they meet certain criteria.
    >
    >For my application, I get a feeling that storing these strings in my
    >table will be a waste of space, and will impact performance due to
    >retrieval and storage times, as well as comparison times.


    Your feeling is irrelevant. You should benchmark. If you do
    not, you may end up jumping through hoops for something that is
    unneeded (though you may never find out it that it is unneeded).

    >I considered using an SHA-512 hash of these strings and storing them in
    >the database. However, while these will save on storage space, it will
    >take time to do the hashing before comparing an incoming string. So I'm
    >still wasting time. (Collisions due to hashing will not be a problem,
    >since an occasional false positive will not be fatal for my application).


    What does "occasional" mean?

    >What would be the best approach?


    If it is a small app, why are you worrying about it so much?
    Create a straightforward design, code it, and see if it is fast
    enough. If it is, keep it. If it is not, then and only then, concern
    yourself with how to speed it up.

    Quit wasting time posting here, and benchmark something. <BEG>

    Sincerely,

    Gene Wirchenko
    Gene Wirchenko, Jun 8, 2011
    #5
  6. Abu Yahya

    Abu Yahya Guest

    On 6/8/2011 10:19 PM, markspace wrote:
    > On 6/8/2011 9:35 AM, Abu Yahya wrote:
    >
    >> I considered using an SHA-512 hash of these strings and storing them in
    >> the database. However, while these will save on storage space, it will
    >> take time to do the hashing before comparing an incoming string. So I'm
    >> still wasting time. (Collisions due to hashing will not be a problem,
    >> since an occasional false positive will not be fatal for my application).

    >
    >
    > You have to store the whole string. Even if the SHA-512 hash codes are
    > equal, it could be that the strings are different. You'll have to
    > eventually compare the raw string, even if the SHA is used as a
    > quick-out case.


    You're right about comparing the whole strings anyway, but, for this
    application I'm creating, I wouldn't mind very very rare incorrect results.

    >
    > No one can really tell what is "faster" or "wasting time" until you can
    > better characterize the usage patterns. How big can these strings get?
    > How often will you get an actual duplicate? What's the penalty when you
    > need to add a new string? You'll need to implement a few algorithms,
    > profile them and then make a decision based on actual data.


    That makes sense. I'd need to analyze my input data and then run some
    empirical tests.

    >
    > For Java, I'd store the strings in a WeakHashMap or similar to allow
    > them to be cached, but tossed away when more storage is needed. Also you
    > should look into getting some DB caching library, much easier than
    > implementing this yourself (sorry I can't personally recommend any).


    The WeakHashMap idea looks useful. As for a DB caching library, would
    you recommend this as a replacement for the WeakHashMap, or as a complement?

    Thanks for the useful pointers.
    Abu Yahya, Jun 8, 2011
    #6
  7. Abu Yahya

    Abu Yahya Guest

    On 6/8/2011 10:58 PM, Willem wrote:
    > markspace wrote:
    > ) On 6/8/2011 9:35 AM, Abu Yahya wrote:
    > )> I considered using an SHA-512 hash of these strings and storing them in
    > )> the database. However, while these will save on storage space, it will
    > )> take time to do the hashing before comparing an incoming string. So I'm
    > )> still wasting time. (Collisions due to hashing will not be a problem,
    > )> since an occasional false positive will not be fatal for my application).
    > )
    > ) You have to store the whole string. Even if the SHA-512 hash codes are
    > ) equal, it could be that the strings are different. You'll have to
    > ) eventually compare the raw string, even if the SHA is used as a
    > ) quick-out case.
    >
    > No he doesn't. Read again. Especially the last bit between parentheses.
    >


    Yeah, it seems he missed that last part or read it incorrectly.
    Abu Yahya, Jun 8, 2011
    #7
  8. Abu Yahya

    Abu Yahya Guest

    On 6/8/2011 10:28 PM, David Kerber wrote:
    > [This followup was posted to comp.lang.java.databases and a copy was
    > sent to the cited author.]
    >
    > In article<iso8cm$a80$>,
    > says...
    >>
    >> A small application that I'm making requires me to store very long
    >> strings (>1000 characters) in a database.

    >
    > Unless you're storing millions of these strings or using Access, I'd say
    > just store and use the string. I think you'll find that the speed
    > penalty is nearly unnoticeable.


    Thanks - simply storing them the way they are does seem to be the best
    way forward.
    Abu Yahya, Jun 8, 2011
    #8
  9. F'up to cljp

    Abu Yahya wrote:

    > I considered using an SHA-512 hash of these strings and storing them in
    > the database. However, while these will save on storage space, it will
    > take time to do the hashing before comparing an incoming string. So I'm
    > still wasting time. (Collisions due to hashing will not be a problem,
    > since an occasional false positive will not be fatal for my application).


    If you write seldom and read often, why not using two columns:

    string_hashcode
    sha1_hashcode

    If the first is equal, you can calculate the sha1-hash for the string
    to be checked and if that is equal as well, you can consider the
    string as equal. That both hashes collide I expect to be very
    very unlikely (which is why I changed the other alg to sha-1, that
    should be considerably more performant than sha512).

    So calculation of the more complex algorithm is only done while
    storing to the database and when checking a string that is already
    in the database. If you have that case very often you still might
    get a better performance with String.hashcode and SHA1 than with
    just SHA512.

    > What would be the best approach?


    There is no single best approach, only an optimal one. Which
    one it is dependend on what defines one way to be better than
    the other (in terms of performance, storage-space, collision-
    rates, etc).


    Regards, Lothar
    --
    Lothar Kimmeringer E-Mail:
    PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

    Always remember: The answer is forty-two, there can only be wrong
    questions!
    Lothar Kimmeringer, Jun 8, 2011
    #9
  10. Abu Yahya

    Abu Yahya Guest

    On 6/8/2011 11:37 PM, Gene Wirchenko wrote:
    > On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya<>
    > wrote:
    >
    >> I considered using an SHA-512 hash of these strings and storing them in
    >> the database. However, while these will save on storage space, it will
    >> take time to do the hashing before comparing an incoming string. So I'm
    >> still wasting time. (Collisions due to hashing will not be a problem,
    >> since an occasional false positive will not be fatal for my application).

    >
    > What does "occasional" mean?
    >


    The application will be doing some probabilistic evaluation using the
    data it is storing, and needs to be fairly close to the actual /most/ of
    the time. Because it would anyway be impossible to be correct 100% of
    the time, using a wrong value for a particular piece of input data will
    not matter, provided it does not happen too often. (In my text above,
    "rare" would have been a better choice of word than "occasional")
    Abu Yahya, Jun 8, 2011
    #10
  11. David Kerber wrote:

    > In article <iso8cm$a80$>,
    > says...
    >>
    >> A small application that I'm making requires me to store very long
    >> strings (>1000 characters) in a database.

    >
    > Unless you're storing millions of these strings or using Access, I'd say
    > just store and use the string. I think you'll find that the speed
    > penalty is nearly unnoticeable.


    Depends on the database. Some of them force you to use CLOBS for
    text-columns with more than 255 characters. CLOBS are a PITA in
    terms of indexing.


    Regards, Lothar
    --
    Lothar Kimmeringer E-Mail:
    PGP-encrypted mails preferred (Key-ID: 0x8BC3CD81)

    Always remember: The answer is forty-two, there can only be wrong
    questions!
    Lothar Kimmeringer, Jun 8, 2011
    #11
  12. On 08.06.2011 18:35, Abu Yahya wrote:
    > A small application that I'm making requires me to store very long
    > strings (>1000 characters) in a database.
    >
    > I will need to use these strings later to compare for equality to
    > incoming strings from another application. I will also want to add some
    > of the incoming strings to the storage, if they meet certain criteria.
    >
    > For my application, I get a feeling that storing these strings in my
    > table will be a waste of space, and will impact performance due to
    > retrieval and storage times, as well as comparison times.
    >
    > I considered using an SHA-512 hash of these strings and storing them in
    > the database. However, while these will save on storage space, it will
    > take time to do the hashing before comparing an incoming string. So I'm
    > still wasting time. (Collisions due to hashing will not be a problem,
    > since an occasional false positive will not be fatal for my application).
    >
    > What would be the best approach?


    Just out of curiosity: what do those strings represent? What do you do
    with them?

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Jun 8, 2011
    #12
  13. On Wed, 08 Jun 2011 20:28:11 +0200, Lothar Kimmeringer wrote:

    > If you write seldom and read often, why not using two columns:
    >
    > string_hashcode
    > sha1_hashcode
    >

    You haven't given us a maximum size for the string or the name of the
    target database, so its possible that implementation constraints will
    prevent the strings from fitting into a character() column and you'll
    have to use a character varying, text, clob etc. instead.

    If this type isn't allowed as the table's primary key, you can use the
    hash code as the primary key. Since it doesn't matter if some strings
    have clashing hash codes this can't cause a problem.


    --
    martin@ | Martin Gregorie
    gregorie. | Essex, UK
    org |
    Martin Gregorie, Jun 8, 2011
    #13
  14. Abu Yahya

    Tom Anderson Guest

    On Wed, 8 Jun 2011, Abu Yahya wrote:

    > A small application that I'm making requires me to store very long
    > strings (>1000 characters) in a database.


    So >1000, but by any chance <2000? <4000?

    > I will need to use these strings later to compare for equality to
    > incoming strings from another application. I will also want to add some
    > of the incoming strings to the storage, if they meet certain criteria.
    >
    > For my application, I get a feeling that storing these strings in my
    > table will be a waste of space, and will impact performance due to
    > retrieval and storage times, as well as comparison times.


    I assume by 'table' you mean an in-memory hashtable. I wouldn't be so
    quick to discount this - how many strings do you have? If you had 25 000
    2000-character strings, that would be 100 MB; that's not an exorbitant
    amount. Access would be pretty quick.

    > I considered using an SHA-512 hash of these strings and storing them in
    > the database. However, while these will save on storage space, it will
    > take time to do the hashing before comparing an incoming string. So I'm
    > still wasting time. (Collisions due to hashing will not be a problem,
    > since an occasional false positive will not be fatal for my
    > application).


    If you're using a database, don't bother with a hash, just put the whole
    string in, index the column, and then do equality queries on it. A B-tree
    index will deal with this kind of query pretty efficiently.

    If you wanted to pursue in-memory approaches, i'd suggest constructing a
    trie of some sort. Tries are very fast at retrieval, don't need to access
    the whole key to identify a miss, and only need to access the whole key
    once to identify a hit - you always walk through the key from beginning to
    end (perhaps skipping some characters in some kinds of tree), stopping as
    soon as the key doesn't match, and only reaching the end if it does match.
    I haven't found a good overview of the current state of the art in tries,
    but look up Patricia tries, Judy tries, burst tries, fusion tries, and HAT
    tries. You could consider only storing a prefix of the strings - the first
    100 characters, say - in the trie, to save memory, with the leaves having
    pointers to where the complete strings are stored on disk.

    tom

    --
    The sun just came out, I can't believe it
    Tom Anderson, Jun 8, 2011
    #14
  15. Abu Yahya

    Harry Tuttle Guest

    Lothar Kimmeringer, 08.06.2011 20:31:
    > Depends on the database. Some of them force you to use CLOBS for
    > text-columns with more than 255 characters. CLOBS are a PITA in
    > terms of indexing.


    No serious DBMS requires you to use a CLOB for columns with more than 255 characters.

    The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed.

    Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)
    Harry Tuttle, Jun 9, 2011
    #15
  16. Gene Wirchenko writes:
    >On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <>
    >wrote:
    >>For my application, I get a feeling that storing these strings in my
    >>table will be a waste of space, and will impact performance due to
    >>retrieval and storage times, as well as comparison times.

    >
    > Your feeling is irrelevant. You should benchmark. If you do
    > not, you may end up jumping through hoops for something that is
    > unneeded (though you may never find out it that it is unneeded).


    Indeed. No point in using a lot of time to solve a non-problem. But
    after the benchmark, the decision can depend on who the application is
    for. If it scales poorly, that can bite other users with different
    input data.

    OTOH if he'll be the only user and he finds that full strings and SHA
    are both too slow: Another approach would be to use a faster hash, count
    hash collisions, and don't bother with more if the count is acceptable.

    Or try tries, as Tom Anderson suggested.

    --
    Hallvard
    Hallvard B Furuseth, Jun 9, 2011
    #16
  17. Gene Wirchenko wrote:
    > On Wed, 08 Jun 2011 22:05:30 +0530, Abu Yahya <>
    > wrote:
    >
    >> I considered using an SHA-512 hash of these strings and storing them in
    >> the database. However, while these will save on storage space, it will
    >> take time to do the hashing before comparing an incoming string. So I'm
    >> still wasting time. (Collisions due to hashing will not be a problem,
    >> since an occasional false positive will not be fatal for my application).

    >
    > What does "occasional" mean?


    Assuming the application doesn't accidentally run afoul of a
    hitherto-unknown flaw in SHA-512 - a tremendously unlikely event - it
    means "once in every N/2**256 uses", where N is the current number of
    hashes in the database. (2**256 because of the Birthday Paradox; we're
    interested in the probability of two arbitrary colliding pre-images.)

    Or, in other words, "probably not before the heat death of the
    universe". (Or false vacuum decay, zombie apocalypse, Rapture, etc.)

    Worrying about the time to hash the string is silly. It's linear in
    the length of the string, so it's roughly equivalent to the time taken
    to do a few comparisons (where "a few" depends on how long, on
    average, the matching prefix of the new string and the candidates is).

    However, as various others have pointed out, this is premature
    optimization. There's no reason to use any design other than the
    obvious until a problem is demonstrated.

    --
    Michael Wojcik
    Micro Focus
    Rhetoric & Writing, Michigan State University
    Michael Wojcik, Jun 9, 2011
    #17
  18. On Jun 8, 9:35 am, Abu Yahya <> wrote:
    > A small application that I'm making requires me to store very long
    > strings (>1000 characters) in a database.
    >
    > I will need to use these strings later to compare for equality to
    > incoming strings from another application. I will also want to add some
    > of the incoming strings to the storage, if they meet certain criteria.
    >
    > For my application, I get a feeling that storing these strings in my
    > table will be a waste of space, and will impact performance due to
    > retrieval and storage times, as well as comparison times.
    >
    > I considered using an SHA-512 hash of these strings and storing them in
    > the database. However, while these will save on storage space, it will
    > take time to do the hashing before comparing an incoming string. So I'm
    > still wasting time. (Collisions due to hashing will not be a problem,
    > since an occasional false positive will not be fatal for my application).
    >
    > What would be the best approach?


    If it's that relevant that you're asking, measure first to see if it's
    a problem. If you're that concerned that it will be, then code a
    number of reasonable alternatives and measure.

    Presumably you need to do a Map lookup on the incoming strings. I
    thought about some itern scheme, but that won't work if you're
    receiving a lot of incoming new strings. Storing hashs could work. Do
    you need to store the strings in a database? If you can store them
    locally, maybe a trie?
    http://en.wikipedia.org/wiki/Trie
    I somewhat doubt (maybe?) that you're going to get much better lookup
    performance than a trie (but of course I would measure too).
    Joshua Maurice, Jun 9, 2011
    #18
  19. Abu Yahya

    Harry Tuttle Guest

    Lothar Kimmeringer, 08.06.2011 20:31:
    > Depends on the database. Some of them force you to use CLOBS for
    > text-columns with more than 255 characters. CLOBS are a PITA in
    > terms of indexing.


    No serious DBMS requires you to use a CLOB for columns with more than 255 characters.

    The least flexible "big" name is Oracle, which has a limit of 4000 bytes. Beyond that a CLOB is needed.

    Most others have a much larger limit on VARCHAR columns (32k, 64k, 1GB, ...)
    Harry Tuttle, Jun 10, 2011
    #19
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. matevz bradac

    storing variable arguments for future use

    matevz bradac, Apr 15, 2004, in forum: C Programming
    Replies:
    6
    Views:
    343
    matevz bradac
    Apr 17, 2004
  2. Replies:
    3
    Views:
    296
  3. Replies:
    2
    Views:
    1,264
    Rolf Magnus
    May 29, 2006
  4. Karin Lagesen

    matching strings in a large set of strings

    Karin Lagesen, Apr 29, 2010, in forum: Python
    Replies:
    13
    Views:
    456
    Bryan
    May 3, 2010
  5. Helmut Jarausch
    Replies:
    3
    Views:
    324
    Dave Angel
    Apr 30, 2010
Loading...

Share This Page