Aha! Your "compression" is the computation of a bucket
index from the hash code. Here's a hint about one reasonable
way to do it: Ask yourself why you chose a prime number as the
bucket count. Was it because prime numbers swirl around hash
tables like a mystical numerological fog, or was there a concrete
reason? If the latter, what was it?
Because i am currently building the dictionary file by trawling news
articles each word I pull from an article needs to be checked in the
dictionary to see if we already have it(I don't want to store each word
more than once). My current methodology means I only have to look at a
maximum of 4 words(out of 2.5 million) to see if I already have this
word stored in memory. Does this still mean my retrieval method is Big(N
Squared)
It depends what you're counting, and how you're counting it.
If you're counting the number of String comparisons involved
in looking up one String value in a table that already holds N
values, the effort is O(N). You choose one of M lists, which holds
(on the average) N/M items, and you search the list. The search
effort is proportional to the list length, hence (since M is a
constant and since Big-Oh swallows proportionality constants) the
search effort is O(N). With a very small multiplier, to be sure,
but still O(N).
But maybe M isn't constant after all; you said you chose it
to be roughly twice the size of the expected N. If you regard M
as a function of N rather than a constant, then the average list
length N/M is N/(2*N) is 0.5, itself a constant, and the search
effort is O(1). Different ground rules, different outcomes.
Or maybe you're looking at the total effort to insert all N
items into an initially empty table. Then under the first model
the effort is proportional to 0/M + 1/M + ... + (N-1)/M, which
is N*(N-1)/(2*M) or O(N*N). Under the second model, the effort
is 0.5 + 0.5 + ... + 0.5 = N/2 which is O(N).
Or maybe you're noting that each word is inserted only once
but looked up many times, in which case things get much more
complex and you need to start worrying about effects like "common"
words entering into the table pretty early in the game while the
table is fairly small, and "rare" words appearing later on when
the table has grown large. Also, searches for "common" words are
almost always successful, while those for "rare" words have more
chance of being unsuccessful. A thorny thicket, to be sure.
Or maybe you're also trying to account for the time spent in
computing the hash codes, in which case the lengths of the Strings
enter into the picture, and not just their quantity.
Big-Oh exists for the purpose of sweeping unimportant details
under the rug, deliberately discarding precision. But you can only
use Big-Oh well if you're precise about what you're approximating,
about what "N" means, about what things you model as constants and
what you consider variable, and so on. You can only throw away
precision if you're precise about throwing it away!