compressing or encrypting a String

W

Wendy S

I'm sorting JPG files into directories based on an IPTC field. I need a way
to create a reasonably short directory name based on a longer (up to 64
chars) String. For example, I might have 100 images, each having one of
these three IPTC Headlines:

Mary Smith | Zorro | Some Event
John Brown | Patrick | Some Event
Alice Jones | Jack | Some Event

I need to generate three short strings that I can use for directory names
that will be the same every time a certain String is processed. I do NOT
need to be able to decode the original value from the shortened one. I only
care that I get the same directory name for the same long string every time
I "compress" it.

After some searching around, I realized that String.hashCode() does
_exactly_ what I'm asking for.

However, vanity requires (or at least wishes) that the directory name be a
little "prettier" and alphanumeric. Is there anything I can do with an
integer like 1263925117 to get something like 4xr52z (and still be
reasonably sure I won't get duplicates, even when the second half of all of
the Strings is likely to be the same)?

Or does anyone know another way to compress the original String into
something shorter?

(If anyone comes upon this later looking for a Java API to read IPTC or EXIF
data from images, look here: http://www.drewnoakes.com/code/exif/ )

Thanks for any ideas,
 
C

Chris

However, vanity requires (or at least wishes) that the directory name be a
little "prettier" and alphanumeric. Is there anything I can do with an
integer like 1263925117 to get something like 4xr52z (and still be
reasonably sure I won't get duplicates, even when the second half of all of
the Strings is likely to be the same)?

String dirName = Integer.toHexString("foo".hashCode());

By encoding the directory name using hex digits instead of decimal digits,
the name will never be longer than 8 characters.
 
W

Wendy S

Chris said:
String dirName = Integer.toHexString("foo".hashCode());
By encoding the directory name using hex digits instead of decimal digits,
the name will never be longer than 8 characters.

Thank you, that's exactly what I needed.
 
W

Wendy S

David Zimmerman said:
hashcodes of strings are not neccessarily unique. Do you mind collisions?

How often do two Strings compare as equal when they're actually not? I
would think it's fairly rare, but in this case it wouldn't be fatal. All
I'm doing is sorting images into subdirectories to make web galleries, so
the worst that happens is I lose a sale if someone can't find their pictures
to place an order.
 
J

James Kanze

Chris said:
Wendy S wrote:
[I'm assuming you mean, "how often to they have equal hashes
when they're not equal?"]
If you are producing 32-bit hash codes, and if your hashing
algorithms produces more-or-less random values, then you'd
expect to find at least 1 pair of strings with the same hash
if you have at least 2^16 input strings. 65,000 files isn't
really /that/ many, so I wouldn't recommend using a 32-bit
hash for this kind of purpose unless you know that your
application is going to be handling a lot less than 65K files.

Note that 1) Java's hash codes are actually only 31 bits (no
unsigned 32 bit type), and 2) the last time I looked, the Java
hash code was not particularly good for certain types of
strings -- the multiplier is too small.
Also that is on the assumption that String.hash() is of
reasonable quality. It doesn't look spectacularly good to me
(though I haven't attempted to validate it).

It is fairly trivial to create test cases where it doesn't work
well -- say all possible two character strings. Whether this is
a problem with real strings depends on what you are doing.
In any case, if you want your hashes to be reliable over a
release of a new JVM implementation, then you can't base your
file names on the String.hash() function since it is possible
that the next version of the JVM will use a different hash
method (it has already changed once, so maybe it'll change
again).

The original version was a disaster. The current version is a
linear congruent hash using a Mersenne prime as a multiplier --
in my experience, these generally make pretty good hash
functions. On the other hand, the multiplier is small, which if
nothing else means that you won't get large numbers from very
short strings.
I think you'd be better off using a hash function that:
a) produces a lot more than 32 bits of output
b) is defined independently of the JVM implementation.

Agreed on both counts.
One excellent source of such hashes is to use a crypto hash
like SHA-1 (160 bit?) or MD5 (128 bit). I've never had to do
that in Java, but I presume one would start with an instance
of MessageDigest.

CRC codes are another good source -- done correctly, they can be
a lot faster than MD5. (I've never implemented SHA-1.)
However the actual crypto-graphic quality is unimportant here,
so you might find it easier and quicker to code your own
64-bit (say) hash function. Here's what I would try first if
I needed to do this, it's translated directly from 'C' so
it'll probably not compile...
/**
* a String hash implemented as a perturbed 64-bit linear congruential
pseudo-random
* number generator.
*/
long
longLinearCongruentialHash(String input)
{
long hash = 0L;
for (int i = 0; i < input.length(); i++)
{
hash += input.charAt(i);
hash *= 6364136223846793005L;
}
return hash;
}

I'd have a look at http://www.isthe.com/chongo/tech/comp/fnv/.
As far as I know, I invented the use of a Mersenne prime for the
multiplicator, but I currently use FNV whenever I need a hash
code -- it's been more or less proven good. About the only time
I might use something else would be on a machine where
multiplication is extremely slow. What made me look at Mersenne
primes in the first place is that the multiplication can be
implemented by means of a single shift and a subtraction -- on
an 8086, this made a whale of a difference, but on a modern
processor?

Note that getting it both correct and rapid without unsigned
arithmetic isn't necessarily trivial. On the other hand, it's
possible (or even probable?) that the errors introduced by
Java's signedness don't really make it that much worse.
 
F

Filip Larsen

James Kanze wrote
CRC codes are another good source -- done correctly, they can be
a lot faster than MD5. (I've never implemented SHA-1.)

I haven't measured their performance, but the build-in
java.util.zip.CRC32 and Adler32 have served me well in the past.


Regards,
 
C

Chris Uppal

James said:
The original version was a disaster. The current version is a
linear congruent hash using a Mersenne prime as a multiplier --

I hadn't twigged that the multiplier (31) was Mersenne prime. I wonder if the
JVM does actually jit the multiplication into a shift and decrement ?


Thanks for the link. I shall add it/them to my toolbox of hashing algorithms
(though I'm most likely to use a diffusion hash these days).

-- chris
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top