Hash Code Compression

j1mb0jay · Jan 11, 2008

I am currently working on a dictionary populating program. I currently
have a socket connection my local news server and am trawling through
all of the articles looking for new words. Java's String class has a
method that hashes strings. I was wondering if i should still be using
these even though I have over two million words in the hash table.
Although the hash table is currently Big 0(4).

I am using the Multiply Add and Divide (MAD) method for the compression
of the hash code, does Java have any built in functions(methods) that
will do this for me, or does anyone know of a more efficient way?

j1mb0jay

Eric Sosman · Jan 11, 2008

j1mb0jay said:
I am currently working on a dictionary populating program. I currently
have a socket connection my local news server and am trawling through
all of the articles looking for new words. Java's String class has a
method that hashes strings. I was wondering if i should still be using
these even though I have over two million words in the hash table.
Although the hash table is currently Big 0(4).

This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
by definition. What do you really mean?

I am using the Multiply Add and Divide (MAD) method for the compression
of the hash code, does Java have any built in functions(methods) that
will do this for me, or does anyone know of a more efficient way?

The value delivered by hashCode -- for any class, not
just for String -- is a Java int, 32 bits wide. How (and why)
are you "compressing" this value?

j1mb0jay · Jan 11, 2008

Eric said:
This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
by definition. What do you really mean?

The value delivered by hashCode -- for any class, not
just for String -- is a Java int, 32 bits wide. How (and why)
are you "compressing" this value?

My hash table is made up of an array of n LinkedLists (where n is a
prime number that is roughly double the number of words in the dictionary).

I firstly use the String.hashCode() method on a given word. I then
compress this number so that i can use it as a index into the array of
LinkedList; as this 32bit number is often far to large. I then insert
the word into the LinkedList array at the compressed value index(The
fact the hashTable is an array of LinkedLists is so that it handles
collisions)

After inserting all of the words into the dictionary the largest
LinkedList in the array has only four elements. I thought Big O(4) was
the correct way of describing this.

Would it help if i posted my classes on here, or offer you a place to
download the program.

j1mb0jay

Daniel Pitts · Jan 11, 2008

My hash table is made up of an array of n LinkedLists (where n is a
prime number that is roughly double the number of words in the dictionary).

I firstly use the String.hashCode() method on a given word. I then
compress this number so that i can use it as a index into the array of
LinkedList; as this 32bit number is often far to large. I then insert
the word into the LinkedList array at the compressed value index(The
fact the hashTable is an array of LinkedLists is so that it handles
collisions)

After inserting all of the words into the dictionary the largest
LinkedList in the array has only four elements. I thought Big O(4) was
the correct way of describing this.

Would it help if i posted my classes on here, or offer you a place to
download the program.

j1mb0jay

Why aren't you using the existing HashMap class?

If you want a compact representation of the words you come across,
consider a prefix tree data structure instead.

Just so you know, Big O measures the dominant term without
multipliers, For instance, if your algorithm takes N *n + N + 4
steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)

Patricia Shanahan · Jan 11, 2008

j1mb0jay said:
My hash table is made up of an array of n LinkedLists (where n is a
prime number that is roughly double the number of words in the dictionary).

I firstly use the String.hashCode() method on a given word. I then
compress this number so that i can use it as a index into the array of
LinkedList; as this 32bit number is often far to large. I then insert
the word into the LinkedList array at the compressed value index(The
fact the hashTable is an array of LinkedLists is so that it handles
collisions)

After inserting all of the words into the dictionary the largest
LinkedList in the array has only four elements. I thought Big O(4) was
the correct way of describing this.

Would it help if i posted my classes on here, or offer you a place to
download the program.

This is very similar to the design of java.util.HashSet, except it
already has methods for mapping from hashCode to bucket number that have
been tested with Java String.

Is there some particular reason for rolling your own rather than using
the java.util class?

Patricia

j1mb0jay · Jan 11, 2008

Daniel said:
Why aren't you using the existing HashMap class?

If you want a compact representation of the words you come across,
consider a prefix tree data structure instead.

Just so you know, Big O measures the dominant term without
multipliers, For instance, if your algorithm takes N *n + N + 4
steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)

I have been asked to create my own data structures to help aid
understanding for the course material for my degree module.(Check
article header)

Because i am currently building the dictionary file by trawling news
articles each word I pull from an article needs to be checked in the
dictionary to see if we already have it(I don't want to store each word
more than once). My current methodology means I only have to look at a
maximum of 4 words(out of 2.5 million) to see if I already have this
word stored in memory. Does this still mean my retrieval method is Big(N
Squared)

j1mb0jay

Patricia Shanahan · Jan 11, 2008

j1mb0jay said:
I have been asked to create my own data structures to help aid
understanding for the course material for my degree module.(Check
article header)

That is indeed a good reason to avoid using the standard classes.

Perhaps you should try a few different hash code to bucket number
mappings, and compare performance. In some situations I have found that
a really simple, quickly calculated mapping such as reduction modulo a
power of two had about the same collision rate as more complicated,
slower to compute, functions.

Because i am currently building the dictionary file by trawling news
articles each word I pull from an article needs to be checked in the
dictionary to see if we already have it(I don't want to store each word
more than once). My current methodology means I only have to look at a
maximum of 4 words(out of 2.5 million) to see if I already have this
word stored in memory. Does this still mean my retrieval method is Big(N
Squared)

Note that the java.util hash-based classes do offer ways of controlling
the number of buckets.

Big-O is about limits as the problem size tends to infinity. If you only
have to look at 4 words regardless of the number of words, then your
lookup performance would be O(1) (Equivalent to O(42), O(1e100) etc. but
O(1) is more conventional). If the number of words you have to examine
depends on an upper bound on the number of words you process, you would
need to examine the effect of increasing the number of words to get a
computational complexity.

Patricia

j1mb0jay · Jan 11, 2008

Patricia said:
This is very similar to the design of java.util.HashSet, except it
already has methods for mapping from hashCode to bucket number that have
been tested with Java String.

Is there some particular reason for rolling your own rather than using
the java.util class?

Patricia

I have to roll out my own for coursework for my data structures module.

j1mb0jay

Daniel Pitts · Jan 11, 2008

I have been asked to create my own data structures to help aid
understanding for the course material for my degree module.(Check
article header)

Because i am currently building the dictionary file by trawling news
articles each word I pull from an article needs to be checked in the
dictionary to see if we already have it(I don't want to store each word
more than once). My current methodology means I only have to look at a
maximum of 4 words(out of 2.5 million) to see if I already have this
word stored in memory. Does this still mean my retrieval method is Big(N
Squared)

j1mb0jay

Actually, it means it has Big O(1) (As hash tables tend to)

Don't use the hash for the index in the linked list. Don't even
BOTHER with indexes in the linked list. The hash should be an index
into an array of lists (of whatever sort, linked or otherwise).

Then each list should be relatively small, so trivial to search/
insert.

Roedy Green · Jan 11, 2008

I have over two million words in the hash table

That is not very much compared with the size of the hashCode. However,
if you needed a bigger hashCode for some reason, there are several
popular digest algorithms that will give you various sized results.
See http://mindprod.com/jgloss/digest.html

The traditional way to prune a hash code to size without losing
"randomness" is to take the absolute value of the modulus.

see http://mindprod.com/jgloss/hashcode.html
http://mindprod.com/jgloss/hashtable.html
http://mindprod.com/jgloss/hashmap.html
for a discussion.

j1mb0jay · Jan 11, 2008

Daniel said:
Actually, it means it has Big O(1) (As hash tables tend to)

Don't use the hash for the index in the linked list. Don't even
BOTHER with indexes in the linked list. The hash should be an index
into an array of lists (of whatever sort, linked or otherwise).

Then each list should be relatively small, so trivial to search/
insert.

My lists are currently smallish, i was just hoping someone would know a
better hash Code function or a better compression method which would
make each linked list have 1 element rather than up to 4 elements. So
that would mean each linked list could be replaced with a simple
KeyValuePair.

Perfection means better marks

j1mb0jay

j1mb0jay · Jan 11, 2008

Roedy said:
That is not very much compared with the size of the hashCode. However,
if you needed a bigger hashCode for some reason, there are several
popular digest algorithms that will give you various sized results.
See http://mindprod.com/jgloss/digest.html

The traditional way to prune a hash code to size without losing
"randomness" is to take the absolute value of the modulus.

see http://mindprod.com/jgloss/hashcode.html
http://mindprod.com/jgloss/hashtable.html
http://mindprod.com/jgloss/hashmap.html
for a discussion.

Thank you, I am looking to prune the hash code rather than make it
larger and will take a look at mindprod.

Thanks again

j1mb0jay

Patricia Shanahan · Jan 11, 2008

j1mb0jay said:
My lists are currently smallish, i was just hoping someone would know a
better hash Code function or a better compression method which would
make each linked list have 1 element rather than up to 4 elements. So
that would mean each linked list could be replaced with a simple
KeyValuePair.

Perfection means better marks

In that case, do a search for "perfect hash".

If you have a fixed list of words, it may be possible to construct a
hash function such that every member of the list has a different code. I
don't know whether it is practical for 2.5 million words - perfect
hashing is usually used for small sets of keywords.

Patricia

j1mb0jay · Jan 12, 2008

Patricia said:
In that case, do a search for "perfect hash".

If you have a fixed list of words, it may be possible to construct a
hash function such that every member of the list has a different code. I
don't know whether it is practical for 2.5 million words - perfect
hashing is usually used for small sets of keywords.

Patricia

I have two main ways of using the dictionary, one for inserting words
for which i use the hash table, lucky for me i have access to servers
with lots of RAM at university, this is why the hash table is double the
size of the words in the dictionary. The second is when I am trying to
break different encryption types. For which I load the words in a binary
sorted tree where the comparable data is the popularity of each word
during the last trawl of the newsgroups.

Lew · Jan 12, 2008

j1mb0jay said:
Thank you, I am looking to prune the hash code rather than make it
larger and will take a look at mindprod.

Patricia's suggestion of taking hashCode = value % k, where k is the current
size of your hash table, is probably the best.

j1mb0jay · Jan 12, 2008

Lew said:
Patricia's suggestion of taking hashCode = value % k, where k is the
current size of your hash table, is probably the best.

Am i correct is saying this will work best if the current size of the
hash table is a prime number ?

j1mb0jay

Patricia Shanahan · Jan 12, 2008

j1mb0jay said:
Am i correct is saying this will work best if the current size of the
hash table is a prime number ?

That seems like a good subject for experiments. It is far from obvious,
especially given the design of the String hashCode method. I would,
however, avoid 31.

Patricia

Charles · Jan 12, 2008

I am currently working on a dictionary populating program. I currently
have a socket connection my local news server and am trawling through
all of the articles looking for new words. Java's String class has a
method that hashes strings. I was wondering if i should still be using
these even though I have over two million words in the hash table.
Although the hash table is currently Big 0(4).

I am using the Multiply Add and Divide (MAD) method for the compression
of the hash code, does Java have any built in functions(methods) that
will do this for me, or does anyone know of a more efficient way?

j1mb0jay

Hi Group:

I don't know much about the MAD method, but LZW is another useful
compression technique. I consider compression akin to folding spaces.
I also think of greatest common divisor algorithms.

I am thinking of it, but I am not showing how to use it. Sorry.

If you have time implement two solutions and continuously improve.

You could also look up compression patents.

Good luck.

Lew · Jan 12, 2008

Charles said:
I don't know much about the MAD method, but LZW is another useful
compression technique. I consider compression akin to folding spaces.
I also think of greatest common divisor algorithms.

I am thinking of it, but I am not showing how to use it. Sorry.

If you have time implement two solutions and continuously improve.

Since the data to be "compressed" is exactly four bytes long, LZW and the like
are way too heavy. Patricia's advice to use the remainder operator % is still
the best.

You could also look up compression patents.

There are at least two U.S. patents for compression algorithms that are not
mathematically feasible. Be careful if you are searching patents.

Roedy Green · Jan 12, 2008

My lists are currently smallish, i was just hoping someone would know a
better hash Code function or a better compression method which would
make each linked list have 1 element rather than up to 4 elements. So
that would mean each linked list could be replaced with a simple
KeyValuePair.

There is an algorithm to create an algorithm to get perfect
disbursement IF you know the values in advance, called GPerf. There
is a link to it at http://mindprod.com/jgloss/hashcode.html

Hash Code	24	May 20, 2008
A text lossy compression scheme	1	Sep 1, 2012
Code help please	4	May 19, 2023
Remote SSH and Configuring code help	0	Dec 13, 2023
Help with code	0	Jun 12, 2022
hash	9	Jun 3, 2006
Hash	0	Mar 17, 2010
Dictionary Builder Application	2	Jan 27, 2008

Hash Code Compression

j1mb0jay

Eric Sosman

j1mb0jay

Daniel Pitts

Patricia Shanahan

j1mb0jay

Patricia Shanahan

j1mb0jay

Daniel Pitts

Roedy Green

j1mb0jay

j1mb0jay

Patricia Shanahan

j1mb0jay

Lew

j1mb0jay

Patricia Shanahan

Charles

Lew

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads