compressing or encrypting a String

Discussion in 'Java' started by Wendy S, Dec 25, 2004.

  1. Wendy S

    Wendy S Guest

    I'm sorting JPG files into directories based on an IPTC field. I need a way
    to create a reasonably short directory name based on a longer (up to 64
    chars) String. For example, I might have 100 images, each having one of
    these three IPTC Headlines:

    Mary Smith | Zorro | Some Event
    John Brown | Patrick | Some Event
    Alice Jones | Jack | Some Event

    I need to generate three short strings that I can use for directory names
    that will be the same every time a certain String is processed. I do NOT
    need to be able to decode the original value from the shortened one. I only
    care that I get the same directory name for the same long string every time
    I "compress" it.

    After some searching around, I realized that String.hashCode() does
    _exactly_ what I'm asking for.

    However, vanity requires (or at least wishes) that the directory name be a
    little "prettier" and alphanumeric. Is there anything I can do with an
    integer like 1263925117 to get something like 4xr52z (and still be
    reasonably sure I won't get duplicates, even when the second half of all of
    the Strings is likely to be the same)?

    Or does anyone know another way to compress the original String into
    something shorter?

    (If anyone comes upon this later looking for a Java API to read IPTC or EXIF
    data from images, look here: http://www.drewnoakes.com/code/exif/ )

    Thanks for any ideas,
    --
    Wendy S
     
    Wendy S, Dec 25, 2004
    #1
    1. Advertising

  2. Wendy S

    Chris Guest

    > However, vanity requires (or at least wishes) that the directory name be a
    > little "prettier" and alphanumeric. Is there anything I can do with an
    > integer like 1263925117 to get something like 4xr52z (and still be
    > reasonably sure I won't get duplicates, even when the second half of all

    of
    > the Strings is likely to be the same)?


    String dirName = Integer.toHexString("foo".hashCode());

    By encoding the directory name using hex digits instead of decimal digits,
    the name will never be longer than 8 characters.
     
    Chris, Dec 25, 2004
    #2
    1. Advertising

  3. Wendy S

    Wendy S Guest

    "Chris" <> wrote
    > String dirName = Integer.toHexString("foo".hashCode());
    > By encoding the directory name using hex digits instead of decimal digits,
    > the name will never be longer than 8 characters.


    Thank you, that's exactly what I needed.

    --
    Wendy
     
    Wendy S, Dec 26, 2004
    #3
  4. Wendy S wrote:
    > "Chris" <> wrote
    >
    >>String dirName = Integer.toHexString("foo".hashCode());
    >>By encoding the directory name using hex digits instead of decimal digits,
    >>the name will never be longer than 8 characters.

    >
    >
    > Thank you, that's exactly what I needed.
    >

    hashcodes of strings are not neccessarily unique. Do you mind collisions?
     
    David Zimmerman, Dec 30, 2004
    #4
  5. Wendy S

    Wendy S Guest

    "David Zimmerman" <> wrote:

    > hashcodes of strings are not neccessarily unique. Do you mind collisions?


    How often do two Strings compare as equal when they're actually not? I
    would think it's fairly rare, but in this case it wouldn't be fatal. All
    I'm doing is sorting images into subdirectories to make web galleries, so
    the worst that happens is I lose a sale if someone can't find their pictures
    to place an order.

    --
    Wendy Smoak
     
    Wendy S, Dec 30, 2004
    #5
  6. Wendy S

    James Kanze Guest

    Chris Uppal wrote:
    > Wendy S wrote:


    >>How often do two Strings compare as equal when they're
    >>actually not?


    > [I'm assuming you mean, "how often to they have equal hashes
    > when they're not equal?"]


    > If you are producing 32-bit hash codes, and if your hashing
    > algorithms produces more-or-less random values, then you'd
    > expect to find at least 1 pair of strings with the same hash
    > if you have at least 2^16 input strings. 65,000 files isn't
    > really /that/ many, so I wouldn't recommend using a 32-bit
    > hash for this kind of purpose unless you know that your
    > application is going to be handling a lot less than 65K files.


    Note that 1) Java's hash codes are actually only 31 bits (no
    unsigned 32 bit type), and 2) the last time I looked, the Java
    hash code was not particularly good for certain types of
    strings -- the multiplier is too small.

    > Also that is on the assumption that String.hash() is of
    > reasonable quality. It doesn't look spectacularly good to me
    > (though I haven't attempted to validate it).


    It is fairly trivial to create test cases where it doesn't work
    well -- say all possible two character strings. Whether this is
    a problem with real strings depends on what you are doing.

    > In any case, if you want your hashes to be reliable over a
    > release of a new JVM implementation, then you can't base your
    > file names on the String.hash() function since it is possible
    > that the next version of the JVM will use a different hash
    > method (it has already changed once, so maybe it'll change
    > again).


    The original version was a disaster. The current version is a
    linear congruent hash using a Mersenne prime as a multiplier --
    in my experience, these generally make pretty good hash
    functions. On the other hand, the multiplier is small, which if
    nothing else means that you won't get large numbers from very
    short strings.

    > I think you'd be better off using a hash function that:


    > a) produces a lot more than 32 bits of output


    > b) is defined independently of the JVM implementation.


    Agreed on both counts.

    > One excellent source of such hashes is to use a crypto hash
    > like SHA-1 (160 bit?) or MD5 (128 bit). I've never had to do
    > that in Java, but I presume one would start with an instance
    > of MessageDigest.


    CRC codes are another good source -- done correctly, they can be
    a lot faster than MD5. (I've never implemented SHA-1.)

    > However the actual crypto-graphic quality is unimportant here,
    > so you might find it easier and quicker to code your own
    > 64-bit (say) hash function. Here's what I would try first if
    > I needed to do this, it's translated directly from 'C' so
    > it'll probably not compile...


    > /**
    > * a String hash implemented as a perturbed 64-bit linear

    congruential
    > pseudo-random
    > * number generator.
    > */
    > long
    > longLinearCongruentialHash(String input)
    > {
    > long hash = 0L;
    > for (int i = 0; i < input.length(); i++)
    > {
    > hash += input.charAt(i);
    > hash *= 6364136223846793005L;
    > }


    > return hash;
    > }


    I'd have a look at http://www.isthe.com/chongo/tech/comp/fnv/.
    As far as I know, I invented the use of a Mersenne prime for the
    multiplicator, but I currently use FNV whenever I need a hash
    code -- it's been more or less proven good. About the only time
    I might use something else would be on a machine where
    multiplication is extremely slow. What made me look at Mersenne
    primes in the first place is that the multiplication can be
    implemented by means of a single shift and a subtraction -- on
    an 8086, this made a whale of a difference, but on a modern
    processor?

    Note that getting it both correct and rapid without unsigned
    arithmetic isn't necessarily trivial. On the other hand, it's
    possible (or even probable?) that the errors introduced by
    Java's signedness don't really make it that much worse.

    --
    James Kanze home: www.gabi-soft.fr
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 pl. Pierre Sémard, 78210 St.-Cyr-l'École, France +33 (0)1 30 23 00 34
     
    James Kanze, Dec 30, 2004
    #6
  7. Wendy S

    Filip Larsen Guest

    James Kanze wrote

    > CRC codes are another good source -- done correctly, they can be
    > a lot faster than MD5. (I've never implemented SHA-1.)


    I haven't measured their performance, but the build-in
    java.util.zip.CRC32 and Adler32 have served me well in the past.


    Regards,
    --
    Filip Larsen
     
    Filip Larsen, Dec 30, 2004
    #7
  8. Wendy S

    Chris Uppal Guest

    James Kanze wrote:

    > > it is possible
    > > that the next version of the JVM will use a different hash
    > > method (it has already changed once, so maybe it'll change
    > > again).

    >
    > The original version was a disaster. The current version is a
    > linear congruent hash using a Mersenne prime as a multiplier --


    I hadn't twigged that the multiplier (31) was Mersenne prime. I wonder if the
    JVM does actually jit the multiplication into a shift and decrement ?


    > I'd have a look at http://www.isthe.com/chongo/tech/comp/fnv/.


    Thanks for the link. I shall add it/them to my toolbox of hashing algorithms
    (though I'm most likely to use a diffusion hash these days).

    -- chris
     
    Chris Uppal, Jan 3, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. news
    Replies:
    2
    Views:
    1,119
  2. sven abels

    Compressing a String?

    sven abels, Dec 2, 2003, in forum: Java
    Replies:
    5
    Views:
    16,315
    Michael Borgwardt
    Dec 3, 2003
  3. Replies:
    1
    Views:
    5,036
  4. Toby A Inkster

    Re: Compressing and encrypting XML

    Toby A Inkster, Jul 8, 2003, in forum: HTML
    Replies:
    0
    Views:
    387
    Toby A Inkster
    Jul 8, 2003
  5. Lynn McGuire
    Replies:
    13
    Views:
    1,607
    Lynn McGuire
    Jul 12, 2010
Loading...

Share This Page