String based hashCode

J

jlukar

The javadoc says the formula to calcuate the hashCode for String is:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

this can result in negative numbers which is not desirable for me.
From the formula this is not clear why as always positive number
should be generated.
Any idea folks ?


Second question:

I am counting on hashCode to help me with generating a unique Integer
key using two concatenated strings the combination of which is
gauranteed to be unique. .

e.g.
("LASTNAME"+"."+"EMPLOYEE-CATEGORY") will always be unique.

so

("LASTNAME"+"."+"EMPLOYEE-CATEGORY") .hashCode() is what I wanted to
use.

will above hashCode() always be unique given that the string
combination used to generate it is unique ?

help???
k.
 
P

Patricia Shanahan

The javadoc says the formula to calcuate the hashCode for String is:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

this can result in negative numbers which is not desirable for me.
From the formula this is not clear why as always positive number
should be generated.
Any idea folks ?

Java int arithmetic is 2's complement signed, and can wrap around from
positive to negative. For example, Integer.MAX_VALUE+1 is negative.
Second question:

I am counting on hashCode to help me with generating a unique Integer
key using two concatenated strings the combination of which is
gauranteed to be unique. .
....

String hash codes are not unique. If a type has more than 2**32
different values, its hash code cannot possibly be unique.

A hash that is unique for all values in its domain is called a "perfect
hash". There are algorithms that will generate a perfect hash for a
small fixed set of strings. Would that help? If not, you need to resort
to the conventional solution - use hashing to reduce the number of
potential collisions, but then search for the exact one you want.

Presumably, you are doing this to solve a higher level problem. If you
were to describe that problem, someone may suggest a better, more
Java-ish, solution.

Patricia
 
L

Lew

The javadoc says the formula to calcuate the hashCode for String is:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

(N.b., the symbol ^ in this context refers to exponentiation, not XOR.)

Why are negative numbers undesirable?

Incidentally, this hash code is not arbitrary. It is difficult to make a
well-designed hash function and this one happens to suit well for 32-bit
twos-complement arithmetic.

See Knuth for more on this.

<http://www-cs-faculty.stanford.edu/~knuth/>
<http://en.wikipedia.org/wiki/Donald_Knuth>
<http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming>

I believe it's /Volume 2 - Seminumerical Algorithms/ that covers this topic.

Beware: changing hashCode() for a class mandates that you also override equals().

-- Lew
 
C

Chris Smith

Lew said:
Beware: changing hashCode() for a class mandates that you also override equals().

The necessity is the other way around. You may choose any hashCode
implementation you like, so long as it is consistent with equals. Any
hash code implementation that is repeatable is consistent with Object's
implementation of equals, so hashCode can literally do practically
anything.

It's when you override equals that maintaining that consistency (namely,
any two objects that are equal also have the same hash code) requires
some effort.
 
M

Mark Rafn

The javadoc says the formula to calcuate the hashCode for String is:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
this can result in negative numbers which is not desirable for me.

Don't use hashCode for something it's not suitable for. If negative numbers
are undesirable, you're using it wrong, and probably want a different method.

Think overflow.
Second question:
I am counting on hashCode to help me with generating a unique Integer
key using two concatenated strings the combination of which is
gauranteed to be unique. .

Then you're using it wrong again. hashCode() has no guarantee of uniqueness,
and for many data types CANNOT guarantee uniqueness.
("LASTNAME"+"."+"EMPLOYEE-CATEGORY") will always be unique.

So equals() should be based on that. great.
("LASTNAME"+"."+"EMPLOYEE-CATEGORY") .hashCode() is what I wanted to
use.

Not unique. There are only 2^32 possible hashCode() return values, and there
are FAR more possible strings than that, so duplicate hash for different
values are alwas an issue. This is called "hash collision".
will above hashCode() always be unique given that the string
combination used to generate it is unique ?

No. Not even close. There's no possible hashing function that will
guarantee uniqueness. ALWAYS account for collisions.

For some purposes, you can get away with using a bigger hash and deciding that
you'll live with the astronomically low chance of collision. You could use
a 128-bit hash (say, MD5), which makes accidental collisions very very
unlikely, but you can't call it hashCode(), because it's not an int.
 
P

Patricia Shanahan

Mark Rafn wrote:
....
No. Not even close. There's no possible hashing function that will
guarantee uniqueness. ALWAYS account for collisions.

I think you are overstating the situation. Given a limited set of
possible key strings, it is possible to construct a perfect hash that
will never map two strings in the set to the same code.

Of course, the Java String hash is not, and cannot be, a perfect hash
because there are more String values than there are hash codes. On the
other hand, the hashCode function for Integer is perfect.

Patricia
 
M

Mark Rafn

Mark said:

Patricia Shanahan said:
I think you are overstating the situation. Given a limited set of
possible key strings, it is possible to construct a perfect hash that
will never map two strings in the set to the same code.

As long as there is an enumerated list of less than 2^32 of them, and you're
willing to do the work, sure. But the OP had something called "lastname" in
his example, of which there are an unlimited number of possibilities.
 
J

jlukar

The javadoc says the formula to calcuate the hashCode for String is:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
this can result in negative numbers which is not desirable for me.
should be generated.
Any idea folks ?

Java int arithmetic is 2's complement signed, and can wrap around from
positive to negative. For example, Integer.MAX_VALUE+1 is negative.

thank you for that. it makes sense now.
...

String hash codes are not unique. If a type has more than 2**32
different values, its hash code cannot possibly be unique.

A hash that is unique for all values in its domain is called a "perfect
hash". There are algorithms that will generate a perfect hash for a
small fixed set of strings. Would that help? If not, you need to resort

If by "small fixed set" you mean something like only "A","F","d" will
appear then that wont work.

I mean that the lengght of string is fixed an no more than 11 chars
overall. You see I know that LASTNAME can not be more than 5
characters. I also know that EMPLOYEE-CATEGORY can not be bigger than
5 characters. So the combination is LASTNAME+.+EMPLOYEE-CATEGORY
= 11 characters total.

I used LASTNAME for clarity sake. The field is actually something else
and can't be more than 5 chars. Same with EMPLOYEE-CATEGORY.

to the conventional solution - use hashing to reduce the number of
potential collisions, but then search for the exact one you want.

Presumably, you are doing this to solve a higher level problem. If you
were to describe that problem, someone may suggest a better, more
Java-ish, solution.

The problem is that I want to generate uniqueue integer keys for the
combonation of "String1"+.+"String2" so that I can use it as a key in
a HashMap type of cache.

I wanted to do it this way so that I can alway derive the key from
"String1" and "String2" that exist in my domain object and therefore
do a cache lookup based on a key derived from "String1" and "String2".

thanks much for your comments already.
 
J

jlukar

The javadoc says the formula to calcuate the hashCode for String is:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

(N.b., the symbol ^ in this context refers to exponentiation, not XOR.)

Why are negative numbers undesirable?

Hi.
I guess to think about it within the context of how I am using it a
negative number is ok. I am using the generated number as a key into
the HashMap for the object. But the comments from all implies that
this is futile because the ("String1"+"String2").hashCode() will not
guarntee a unique key anyways.


Incidentally, this hash code is not arbitrary. It is difficult to make a
well-designed hash function and this one happens to suit well for 32-bit
twos-complement arithmetic.

See Knuth for more on this.

<http://www-cs-faculty.stanford.edu/~knuth/>
<http://en.wikipedia.org/wiki/Donald_Knuth>
<http://en.wikipedia.org/wiki/The_Art_of_Computer_Programming>

thank you for the links. I was hoping for a JAVA API for this without
the use of a algorithm.
 
P

Patricia Shanahan

The problem is that I want to generate uniqueue integer keys for the
combonation of "String1"+.+"String2" so that I can use it as a key in
a HashMap type of cache.

Why not use a HashMap with the concatenated string as key? HashMap knows
all about potential collisions, and deals with them correctly.

Patricia
 
J

jlukar

(e-mail address removed) wrote:

...


Why not use a HashMap with the concatenated string as key? HashMap knows
all about potential collisions, and deals with them correctly.

Patricia


You mean instead of using the int I simply use the
"String1"."String2" for the key ?


The only reason I was trying to get a unique integer out of it was
because the domain Object I was using (lets say Employee) has a
employee ID as a key. Some employees from one building (building A)
all have employee ID's. Employees from another building (building B)
do not have employee ID's yet the legacy system I am working on uses
employee ID as the key into a HashMap.

I was trying to use the same interface to generate integer unique ID's
for the employess from building B.

thanks
 
J

jlukar

As long as there is an enumerated list of less than 2^32 of them, and you're
willing to do the work, sure. But the OP had something called "lastname" in
his example, of which there are an unlimited number of possibilities.

Hi. Let me clarify. The Lastname is used to oversimplify my example.
In actuality the "lastname" is a business domain field that represents
a string of 5 characters. So it could be "ABCDE" or any
combination of. That combined with "EMPLOYEE-CATEGORY" which is also
5 characters gives a combined number of 10 characters. So something
like this "ABCDE"."XIEPE" is a posiblity. But any permutation of
characters could be in those two fields.

thanks


 
J

Joshua Cranmer

The javadoc says the formula to calcuate the hashCode for String is:
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
this can result in negative numbers which is not desirable for me.
From the formula this is not clear why as always positive number
should be generated.
Any idea folks ?
Java int arithmetic is 2's complement signed, and can wrap around from
positive to negative. For example, Integer.MAX_VALUE+1 is negative.

thank you for that. it makes sense now.
...

String hash codes are not unique. If a type has more than 2**32
different values, its hash code cannot possibly be unique.

A hash that is unique for all values in its domain is called a "perfect
hash". There are algorithms that will generate a perfect hash for a
small fixed set of strings. Would that help? If not, you need to resort

If by "small fixed set" you mean something like only "A","F","d" will
appear then that wont work.

I mean that the lengght of string is fixed an no more than 11 chars
overall. You see I know that LASTNAME can not be more than 5
characters. I also know that EMPLOYEE-CATEGORY can not be bigger than
5 characters. So the combination is LASTNAME+.+EMPLOYEE-CATEGORY
= 11 characters total.
Can't be unique. Each character has 2^16 possible values and, not
including the period, there are 10 values, so there are 2^16^10 = 2^160
~ 10^48 different possible combinations. Of course, I assume that you
are using only ~128 = 2^7 values of the characters (7-bit ASCII), but
even so, a hash would produce 2^7^10 = 2^70 ~ 10^21 combinations.
I used LASTNAME for clarity sake. The field is actually something else
and can't be more than 5 chars. Same with EMPLOYEE-CATEGORY.



The problem is that I want to generate uniqueue integer keys for the
combonation of "String1"+.+"String2" so that I can use it as a key in
a HashMap type of cache.

I wanted to do it this way so that I can alway derive the key from
"String1" and "String2" that exist in my domain object and therefore
do a cache lookup based on a key derived from "String1" and "String2".
Most HashMap types of cache implement bucketed algorithms or such.
 
J

Joshua Cranmer

Hi. Let me clarify. The Lastname is used to oversimplify my example.
In actuality the "lastname" is a business domain field that represents
a string of 5 characters. So it could be "ABCDE" or any
combination of. That combined with "EMPLOYEE-CATEGORY" which is also
5 characters gives a combined number of 10 characters. So something
like this "ABCDE"."XIEPE" is a posiblity. But any permutation of
characters could be in those two fields.

thanks
Just saw this. There are 5^5*26^5 (I assume) = (130)^5, 5 * lg (130)
~ 35.1, so there are too many to make into a unique int. That said, if
you only had 10 characters with 5 possible entries each, a potential
would be to treat it as a 10-digit base 5 number:
((a[0]*5+a[1])*5+a[2])*5+...
 
J

jlukar

Hi. Let me clarify. The Lastname is used to oversimplify my example.
In actuality the "lastname" is a business domain field that represents
a string of 5 characters. So it could be "ABCDE" or any
combination of. That combined with "EMPLOYEE-CATEGORY" which is also
5 characters gives a combined number of 10 characters. So something
like this "ABCDE"."XIEPE" is a posiblity. But any permutation of
characters could be in those two fields.

Just saw this. There are 5^5*26^5 (I assume) = (130)^5, 5 * lg (130)
~ 35.1, so there are too many to make into a unique int. That said, if
you only had 10 characters with 5 possible entries each, a potential
would be to treat it as a 10-digit base 5 number:
((a[0]*5+a[1])*5+a[2])*5+...



Thats more than I'd like to chew on. I really needed a simple unique
integer key derived from two strings concat of which is guaruanteed to
be unique within my business application.

looks like it might not be possible in my particular situation using a
simple Java API. I might have to write my own key generator derived
from the string concat.

anything in apache commons library that might help me there if anyone
knows about.

thanks much for all input.
 
M

Mark Rafn

Thats more than I'd like to chew on. I really needed a simple unique
integer key derived from two strings concat of which is guaruanteed to
be unique within my business application.

One possible way is to use a database. Store the string with a
sequence-defined identifier, and use that identifier as your ID, guaranteed
unique. Or just use the string as the key.

You haven't really explained why you need it to be unique, though. Java's
HashMap implementation (and every other implementation of hash I've seen)
handles collisions just fine, and you can too.
 
D

Daniel Pitts

Just saw this. There are 5^5*26^5 (I assume) = (130)^5, 5 * lg (130)
~ 35.1, so there are too many to make into a unique int. That said, if
you only had 10 characters with 5 possible entries each, a potential
would be to treat it as a 10-digit base 5 number:
((a[0]*5+a[1])*5+a[2])*5+...

Thats more than I'd like to chew on. I really needed a simple unique
integer key derived from two strings concat of which is guaruanteed to
be unique within my business application.

looks like it might not be possible in my particular situation using a
simple Java API. I might have to write my own key generator derived
from the string concat.

anything in apache commons library that might help me there if anyone
knows about.

thanks much for all input.

It is very unlikely that you will find a hash code that is unique. As
a matter of fact, you may have misunderstood the use of hashes.

A hash code helps you decide which "bucket" to look in for the key
that you want. If you have 5 buckets and 30 unique keys, you aren't
going to have 30 unique hash codes, but 5. This lets you reduce your
average look up by a factor of 5. You only have to search (on
average) a bucket with 6 items in it.

So, to reiterate. A hash isn't the same as a key. The hash is a
property of the key, and is defined so that if keyA is equal to keyB,
then keyA's hash is equal to keyB's hash. The converse isn't
necessarily true.

This lets you know that if hashA != hashB, then A != B. if hashA ==
hashB, A might = B.

Having said all that...
If you restrict the range of values an object can take, then you
*might* be able to construct a hash function that results in unique
hashs for unique objects.

If your business field is exactly 5 uppercase letters, then that gives
you a possibility of 26^5 combinations, which fits within a 24 bit
number. You can get your hash value by treating the string of 5
uppercase letters as a base 26 number, A = 0, B = 1, ..., Z = 25.
The string AZWAA would be the number 0 * 26^0 + 25 * 26^1 + 22 * 26^2
+ 0 * 26^3 + 0 * 26^4
Notice that each term is a factor of a power of 26.

Hope this helps.
 
L

Lew

Mark said:
One possible way is to use a database. Store the string with a
sequence-defined identifier, and use that identifier as your ID, guaranteed
unique. Or just use the string as the key.

You haven't really explained why you need it to be unique, though. Java's
HashMap implementation (and every other implementation of hash I've seen)
handles collisions just fine, and you can too.

Really he should just use the concatenated String as the key and forget all
about using an int.

-- Lew
 
E

Eric Sosman

[concerning non-uniqueness of hash codes]
I mean that the lengght of string is fixed an no more than 11 chars
overall. You see I know that LASTNAME can not be more than 5
characters. I also know that EMPLOYEE-CATEGORY can not be bigger than
5 characters. So the combination is LASTNAME+.+EMPLOYEE-CATEGORY
= 11 characters total.

Let's do some arithmetic.

There is exactly one String of length zero.

There are 65536 different Strings of length one.

There are 65536^2 = 4,294,967,296 Strings of length two
(using ^ to indicate exponentiation).

...

There are 65536^11 Strings of length eleven.

All told, there are (65536^12 - 1) / 65535 Strings of
length eleven or less. That's just a little under 1e53 (the
11-character Strings make up the vast majority).

Meanwhile, there are 2^32 = 4,294,967,296 distinct hashCode()
values, because that's how many different `int' values exist.

4,294,967,296 may seem like a lot of hashCode() values, but
next to 95,782,432,828,056,470,036,404,813,049,000,000,000,000,-
000,000,000,000+ it is as nothing. If you try to marry the
Strings and the hashCode()s, you will commit bigamy on a truly
grand scale.

And that is why your hope for a universal and unique hash
code for Strings -- even for short Strings -- is in vain.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

hashCode 95
hashCode 50
hashcode 5
Hashcode and Equal 4
why 31 in hashcode for string 8
Eclipse generated hashCode() and Compiler Efficiency 4
hashCode and equals (again) 19
equals() and hashcode() not working... 6

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top