Hash Code Compression

Roedy Green · Jan 12, 2008

Am i correct is saying this will work best if the current size of the
hash table is a prime number ?

You preserve the most high order randomness that way. This is
discussed in the links I gave you earlier.

Roedy Green · Jan 12, 2008

I don't know much about the MAD method, but LZW is another useful
compression technique. I consider compression akin to folding spaces.
I also think of greatest common divisor algorithms.

I am thinking of it, but I am not showing how to use it. Sorry.

see http://mindprod.com/jgloss/gzip.html

The FileIO amanuensis will show you how to do an purely in-RAM
implementation.

See http://mindprod.com/jgloss/applet/fileio.html

As Patricia said, do experiments and measure. It is hard to know in
advance which algorithms will perform best.

j1mb0jay · Jan 12, 2008

Charles said:
Hi Group:

I don't know much about the MAD method, but LZW is another useful
compression technique. I consider compression akin to folding spaces.
I also think of greatest common divisor algorithms.

I am thinking of it, but I am not showing how to use it. Sorry.

If you have time implement two solutions and continuously improve.

You could also look up compression patents.

Good luck.

I am not trying to compress a file or image so Lempel-Ziv-Welch
Compression wont work for me (or so i think), i just want to compress an
integer so i can use it as a index to a fixed size array. I seem to have
improved my current method by by simply using the % of the 2 number and
the size of the array!!!

j1mb0jay

j1mb0jay · Jan 12, 2008

Lew said:
Since the data to be "compressed" is exactly four bytes long, LZW and
the like are way too heavy. Patricia's advice to use the remainder
operator % is still the best.

There are at least two U.S. patents for compression algorithms that are
not mathematically feasible. Be careful if you are searching patents.

I am not allowed to use any patents for my coursework it states this in
the student hand book very clearly, and as stated i feel it is over the
top for what i need.

j1mb0jay

Patricia Shanahan · Jan 12, 2008

j1mb0jay said:
I am not trying to compress a file or image so Lempel-Ziv-Welch
Compression wont work for me (or so i think), i just want to compress an
integer so i can use it as a index to a fixed size array. I seem to have
improved my current method by by simply using the % of the 2 number and
the size of the array!!!

j1mb0jay

I think the situation would be clearer if you thought of it, and
described it, as the mapping from hash code to hash table bucket number.

In the current situation, with a hash function that returns a 32 bit int
and a relatively small hash table, the mapping will be a compression
algorithm in the sense that the range is smaller than the domain, but
the important properties are different.

The main property you need is a tendency for the mapping from a String
representing a word, through the String's hash code, to the bucket
number, to distribute words as evenly as possible among buckets. That
does not correspond to any normal requirement on compression algorithms.

Patricia

Eric Sosman · Jan 12, 2008

Aha! Your "compression" is the computation of a bucket
index from the hash code. Here's a hint about one reasonable
way to do it: Ask yourself why you chose a prime number as the
bucket count. Was it because prime numbers swirl around hash
tables like a mystical numerological fog, or was there a concrete
reason? If the latter, what was it?

Because i am currently building the dictionary file by trawling news
articles each word I pull from an article needs to be checked in the
dictionary to see if we already have it(I don't want to store each word
more than once). My current methodology means I only have to look at a
maximum of 4 words(out of 2.5 million) to see if I already have this
word stored in memory. Does this still mean my retrieval method is Big(N
Squared)

It depends what you're counting, and how you're counting it.

If you're counting the number of String comparisons involved
in looking up one String value in a table that already holds N
values, the effort is O(N). You choose one of M lists, which holds
(on the average) N/M items, and you search the list. The search
effort is proportional to the list length, hence (since M is a
constant and since Big-Oh swallows proportionality constants) the
search effort is O(N). With a very small multiplier, to be sure,
but still O(N).

But maybe M isn't constant after all; you said you chose it
to be roughly twice the size of the expected N. If you regard M
as a function of N rather than a constant, then the average list
length N/M is N/(2*N) is 0.5, itself a constant, and the search
effort is O(1). Different ground rules, different outcomes.

Or maybe you're looking at the total effort to insert all N
items into an initially empty table. Then under the first model
the effort is proportional to 0/M + 1/M + ... + (N-1)/M, which
is N*(N-1)/(2*M) or O(N*N). Under the second model, the effort
is 0.5 + 0.5 + ... + 0.5 = N/2 which is O(N).

Or maybe you're noting that each word is inserted only once
but looked up many times, in which case things get much more
complex and you need to start worrying about effects like "common"
words entering into the table pretty early in the game while the
table is fairly small, and "rare" words appearing later on when
the table has grown large. Also, searches for "common" words are
almost always successful, while those for "rare" words have more
chance of being unsuccessful. A thorny thicket, to be sure.

Or maybe you're also trying to account for the time spent in
computing the hash codes, in which case the lengths of the Strings
enter into the picture, and not just their quantity.

Big-Oh exists for the purpose of sweeping unimportant details
under the rug, deliberately discarding precision. But you can only
use Big-Oh well if you're precise about what you're approximating,
about what "N" means, about what things you model as constants and
what you consider variable, and so on. You can only throw away
precision if you're precise about throwing it away!

Chris · Jan 12, 2008

j1mb0jay said:
My lists are currently smallish, i was just hoping someone would know a
better hash Code function or a better compression method which would
make each linked list have 1 element rather than up to 4 elements. So
that would mean each linked list could be replaced with a simple
KeyValuePair.

Perfection means better marks

j1mb0jay

Your choice of data structure should depend on what you're trying to
optimize for.

If you're trying to optimize for speed, then a traditional hash table is
a good choice. You're doing it right.

If you're trying to optimize for space, I would try to reduce your
object count. Each object takes up a lot of memory. You could try to
keep the words, and the pointers to them, in big, possibly paged arrays.
(I've done this with a big int [] array and pages of char [] arrays. It
works and it's pretty quick). Another way to optimize for space is to
implement some kind of trie, like a patricia trie. Yet another way would
be to do something btree-like, and implement key prefix compression.
(It's a ton of work, but really efficient).

If you're trying to optimize for marks, then implement a Bloom Filter
for dictionary lookup.

You'll blow them away...

Hash Code	24	May 20, 2008
A text lossy compression scheme	1	Sep 1, 2012
Code help please	4	May 19, 2023
Remote SSH and Configuring code help	0	Dec 13, 2023
Help with code	0	Jun 12, 2022
hash	9	Jun 3, 2006
Hash	0	Mar 17, 2010
Dictionary Builder Application	2	Jan 27, 2008

Hash Code Compression

Roedy Green

Roedy Green

j1mb0jay

j1mb0jay

Patricia Shanahan

Eric Sosman

Chris

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads