Hash Code Compression

Discussion in 'Java' started by j1mb0jay, Jan 11, 2008.

  1. j1mb0jay

    j1mb0jay Guest

    I am currently working on a dictionary populating program. I currently
    have a socket connection my local news server and am trawling through
    all of the articles looking for new words. Java's String class has a
    method that hashes strings. I was wondering if i should still be using
    these even though I have over two million words in the hash table.
    Although the hash table is currently Big 0(4).

    I am using the Multiply Add and Divide (MAD) method for the compression
    of the hash code, does Java have any built in functions(methods) that
    will do this for me, or does anyone know of a more efficient way?

    j1mb0jay
     
    j1mb0jay, Jan 11, 2008
    #1
    1. Advertising

  2. j1mb0jay

    Eric Sosman Guest

    j1mb0jay wrote:
    > I am currently working on a dictionary populating program. I currently
    > have a socket connection my local news server and am trawling through
    > all of the articles looking for new words. Java's String class has a
    > method that hashes strings. I was wondering if i should still be using
    > these even though I have over two million words in the hash table.
    > Although the hash table is currently Big 0(4).


    This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    by definition. What do you really mean?

    > I am using the Multiply Add and Divide (MAD) method for the compression
    > of the hash code, does Java have any built in functions(methods) that
    > will do this for me, or does anyone know of a more efficient way?


    The value delivered by hashCode -- for any class, not
    just for String -- is a Java int, 32 bits wide. How (and why)
    are you "compressing" this value?

    --
    Eric Sosman
    lid
     
    Eric Sosman, Jan 11, 2008
    #2
    1. Advertising

  3. j1mb0jay

    j1mb0jay Guest

    Eric Sosman wrote:
    > j1mb0jay wrote:
    >> I am currently working on a dictionary populating program. I currently
    >> have a socket connection my local news server and am trawling through
    >> all of the articles looking for new words. Java's String class has a
    >> method that hashes strings. I was wondering if i should still be using
    >> these even though I have over two million words in the hash table.
    >> Although the hash table is currently Big 0(4).

    >
    > This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    > by definition. What do you really mean?
    >
    >> I am using the Multiply Add and Divide (MAD) method for the
    >> compression of the hash code, does Java have any built in
    >> functions(methods) that will do this for me, or does anyone know of a
    >> more efficient way?

    >
    > The value delivered by hashCode -- for any class, not
    > just for String -- is a Java int, 32 bits wide. How (and why)
    > are you "compressing" this value?
    >



    My hash table is made up of an array of n LinkedLists (where n is a
    prime number that is roughly double the number of words in the dictionary).

    I firstly use the String.hashCode() method on a given word. I then
    compress this number so that i can use it as a index into the array of
    LinkedList; as this 32bit number is often far to large. I then insert
    the word into the LinkedList array at the compressed value index(The
    fact the hashTable is an array of LinkedLists is so that it handles
    collisions)

    After inserting all of the words into the dictionary the largest
    LinkedList in the array has only four elements. I thought Big O(4) was
    the correct way of describing this.

    Would it help if i posted my classes on here, or offer you a place to
    download the program.

    j1mb0jay
     
    j1mb0jay, Jan 11, 2008
    #3
  4. j1mb0jay

    Daniel Pitts Guest

    On Jan 11, 3:05 pm, j1mb0jay <> wrote:
    > Eric Sosman wrote:
    > > j1mb0jay wrote:
    > >> I am currently working on a dictionary populating program. I currently
    > >> have a socket connection my local news server and am trawling through
    > >> all of the articles looking for new words. Java's String class has a
    > >> method that hashes strings. I was wondering if i should still be using
    > >> these even though I have over two million words in the hash table.
    > >> Although the hash table is currently Big 0(4).

    >
    > > This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    > > by definition. What do you really mean?

    >
    > >> I am using the Multiply Add and Divide (MAD) method for the
    > >> compression of the hash code, does Java have any built in
    > >> functions(methods) that will do this for me, or does anyone know of a
    > >> more efficient way?

    >
    > > The value delivered by hashCode -- for any class, not
    > > just for String -- is a Java int, 32 bits wide. How (and why)
    > > are you "compressing" this value?

    >
    > My hash table is made up of an array of n LinkedLists (where n is a
    > prime number that is roughly double the number of words in the dictionary).
    >
    > I firstly use the String.hashCode() method on a given word. I then
    > compress this number so that i can use it as a index into the array of
    > LinkedList; as this 32bit number is often far to large. I then insert
    > the word into the LinkedList array at the compressed value index(The
    > fact the hashTable is an array of LinkedLists is so that it handles
    > collisions)
    >
    > After inserting all of the words into the dictionary the largest
    > LinkedList in the array has only four elements. I thought Big O(4) was
    > the correct way of describing this.
    >
    > Would it help if i posted my classes on here, or offer you a place to
    > download the program.
    >
    > j1mb0jay


    Why aren't you using the existing HashMap class?

    If you want a compact representation of the words you come across,
    consider a prefix tree data structure instead.

    Just so you know, Big O measures the dominant term without
    multipliers, For instance, if your algorithm takes N *n + N + 4
    steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)
     
    Daniel Pitts, Jan 11, 2008
    #4
  5. j1mb0jay wrote:
    > Eric Sosman wrote:
    >> j1mb0jay wrote:
    >>> I am currently working on a dictionary populating program. I
    >>> currently have a socket connection my local news server and am
    >>> trawling through all of the articles looking for new words. Java's
    >>> String class has a method that hashes strings. I was wondering if i
    >>> should still be using these even though I have over two million words
    >>> in the hash table. Although the hash table is currently Big 0(4).

    >>
    >> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    >> by definition. What do you really mean?
    >>
    >>> I am using the Multiply Add and Divide (MAD) method for the
    >>> compression of the hash code, does Java have any built in
    >>> functions(methods) that will do this for me, or does anyone know of a
    >>> more efficient way?

    >>
    >> The value delivered by hashCode -- for any class, not
    >> just for String -- is a Java int, 32 bits wide. How (and why)
    >> are you "compressing" this value?
    >>

    >
    >
    > My hash table is made up of an array of n LinkedLists (where n is a
    > prime number that is roughly double the number of words in the dictionary).
    >
    > I firstly use the String.hashCode() method on a given word. I then
    > compress this number so that i can use it as a index into the array of
    > LinkedList; as this 32bit number is often far to large. I then insert
    > the word into the LinkedList array at the compressed value index(The
    > fact the hashTable is an array of LinkedLists is so that it handles
    > collisions)
    >
    > After inserting all of the words into the dictionary the largest
    > LinkedList in the array has only four elements. I thought Big O(4) was
    > the correct way of describing this.
    >
    > Would it help if i posted my classes on here, or offer you a place to
    > download the program.


    This is very similar to the design of java.util.HashSet, except it
    already has methods for mapping from hashCode to bucket number that have
    been tested with Java String.

    Is there some particular reason for rolling your own rather than using
    the java.util class?

    Patricia
     
    Patricia Shanahan, Jan 11, 2008
    #5
  6. j1mb0jay

    j1mb0jay Guest

    Daniel Pitts wrote:
    > On Jan 11, 3:05 pm, j1mb0jay <> wrote:
    >> Eric Sosman wrote:
    >>> j1mb0jay wrote:
    >>>> I am currently working on a dictionary populating program. I currently
    >>>> have a socket connection my local news server and am trawling through
    >>>> all of the articles looking for new words. Java's String class has a
    >>>> method that hashes strings. I was wondering if i should still be using
    >>>> these even though I have over two million words in the hash table.
    >>>> Although the hash table is currently Big 0(4).
    >>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    >>> by definition. What do you really mean?
    >>>> I am using the Multiply Add and Divide (MAD) method for the
    >>>> compression of the hash code, does Java have any built in
    >>>> functions(methods) that will do this for me, or does anyone know of a
    >>>> more efficient way?
    >>> The value delivered by hashCode -- for any class, not
    >>> just for String -- is a Java int, 32 bits wide. How (and why)
    >>> are you "compressing" this value?

    >> My hash table is made up of an array of n LinkedLists (where n is a
    >> prime number that is roughly double the number of words in the dictionary).
    >>
    >> I firstly use the String.hashCode() method on a given word. I then
    >> compress this number so that i can use it as a index into the array of
    >> LinkedList; as this 32bit number is often far to large. I then insert
    >> the word into the LinkedList array at the compressed value index(The
    >> fact the hashTable is an array of LinkedLists is so that it handles
    >> collisions)
    >>
    >> After inserting all of the words into the dictionary the largest
    >> LinkedList in the array has only four elements. I thought Big O(4) was
    >> the correct way of describing this.
    >>
    >> Would it help if i posted my classes on here, or offer you a place to
    >> download the program.
    >>
    >> j1mb0jay

    >
    > Why aren't you using the existing HashMap class?
    >
    > If you want a compact representation of the words you come across,
    > consider a prefix tree data structure instead.
    >
    > Just so you know, Big O measures the dominant term without
    > multipliers, For instance, if your algorithm takes N *n + N + 4
    > steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)


    I have been asked to create my own data structures to help aid
    understanding for the course material for my degree module.(Check
    article header)

    Because i am currently building the dictionary file by trawling news
    articles each word I pull from an article needs to be checked in the
    dictionary to see if we already have it(I don't want to store each word
    more than once). My current methodology means I only have to look at a
    maximum of 4 words(out of 2.5 million) to see if I already have this
    word stored in memory. Does this still mean my retrieval method is Big(N
    Squared)

    j1mb0jay
     
    j1mb0jay, Jan 11, 2008
    #6
  7. j1mb0jay wrote:
    > Daniel Pitts wrote:
    >> On Jan 11, 3:05 pm, j1mb0jay <> wrote:
    >>> Eric Sosman wrote:
    >>>> j1mb0jay wrote:
    >>>>> I am currently working on a dictionary populating program. I currently
    >>>>> have a socket connection my local news server and am trawling through
    >>>>> all of the articles looking for new words. Java's String class has a
    >>>>> method that hashes strings. I was wondering if i should still be using
    >>>>> these even though I have over two million words in the hash table.
    >>>>> Although the hash table is currently Big 0(4).
    >>>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    >>>> by definition. What do you really mean?
    >>>>> I am using the Multiply Add and Divide (MAD) method for the
    >>>>> compression of the hash code, does Java have any built in
    >>>>> functions(methods) that will do this for me, or does anyone know of a
    >>>>> more efficient way?
    >>>> The value delivered by hashCode -- for any class, not
    >>>> just for String -- is a Java int, 32 bits wide. How (and why)
    >>>> are you "compressing" this value?
    >>> My hash table is made up of an array of n LinkedLists (where n is a
    >>> prime number that is roughly double the number of words in the
    >>> dictionary).
    >>>
    >>> I firstly use the String.hashCode() method on a given word. I then
    >>> compress this number so that i can use it as a index into the array of
    >>> LinkedList; as this 32bit number is often far to large. I then insert
    >>> the word into the LinkedList array at the compressed value index(The
    >>> fact the hashTable is an array of LinkedLists is so that it handles
    >>> collisions)
    >>>
    >>> After inserting all of the words into the dictionary the largest
    >>> LinkedList in the array has only four elements. I thought Big O(4) was
    >>> the correct way of describing this.
    >>>
    >>> Would it help if i posted my classes on here, or offer you a place to
    >>> download the program.
    >>>
    >>> j1mb0jay

    >>
    >> Why aren't you using the existing HashMap class?
    >>
    >> If you want a compact representation of the words you come across,
    >> consider a prefix tree data structure instead.
    >>
    >> Just so you know, Big O measures the dominant term without
    >> multipliers, For instance, if your algorithm takes N *n + N + 4
    >> steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)

    >
    > I have been asked to create my own data structures to help aid
    > understanding for the course material for my degree module.(Check
    > article header)


    That is indeed a good reason to avoid using the standard classes.

    Perhaps you should try a few different hash code to bucket number
    mappings, and compare performance. In some situations I have found that
    a really simple, quickly calculated mapping such as reduction modulo a
    power of two had about the same collision rate as more complicated,
    slower to compute, functions.


    >
    > Because i am currently building the dictionary file by trawling news
    > articles each word I pull from an article needs to be checked in the
    > dictionary to see if we already have it(I don't want to store each word
    > more than once). My current methodology means I only have to look at a
    > maximum of 4 words(out of 2.5 million) to see if I already have this
    > word stored in memory. Does this still mean my retrieval method is Big(N
    > Squared)



    Note that the java.util hash-based classes do offer ways of controlling
    the number of buckets.

    Big-O is about limits as the problem size tends to infinity. If you only
    have to look at 4 words regardless of the number of words, then your
    lookup performance would be O(1) (Equivalent to O(42), O(1e100) etc. but
    O(1) is more conventional). If the number of words you have to examine
    depends on an upper bound on the number of words you process, you would
    need to examine the effect of increasing the number of words to get a
    computational complexity.

    Patricia
     
    Patricia Shanahan, Jan 11, 2008
    #7
  8. j1mb0jay

    j1mb0jay Guest

    Patricia Shanahan wrote:
    > j1mb0jay wrote:
    >> Eric Sosman wrote:
    >>> j1mb0jay wrote:
    >>>> I am currently working on a dictionary populating program. I
    >>>> currently have a socket connection my local news server and am
    >>>> trawling through all of the articles looking for new words. Java's
    >>>> String class has a method that hashes strings. I was wondering if i
    >>>> should still be using these even though I have over two million
    >>>> words in the hash table. Although the hash table is currently Big 0(4).
    >>>
    >>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    >>> by definition. What do you really mean?
    >>>
    >>>> I am using the Multiply Add and Divide (MAD) method for the
    >>>> compression of the hash code, does Java have any built in
    >>>> functions(methods) that will do this for me, or does anyone know of
    >>>> a more efficient way?
    >>>
    >>> The value delivered by hashCode -- for any class, not
    >>> just for String -- is a Java int, 32 bits wide. How (and why)
    >>> are you "compressing" this value?
    >>>

    >>
    >>
    >> My hash table is made up of an array of n LinkedLists (where n is a
    >> prime number that is roughly double the number of words in the
    >> dictionary).
    >>
    >> I firstly use the String.hashCode() method on a given word. I then
    >> compress this number so that i can use it as a index into the array of
    >> LinkedList; as this 32bit number is often far to large. I then insert
    >> the word into the LinkedList array at the compressed value index(The
    >> fact the hashTable is an array of LinkedLists is so that it handles
    >> collisions)
    >>
    >> After inserting all of the words into the dictionary the largest
    >> LinkedList in the array has only four elements. I thought Big O(4) was
    >> the correct way of describing this.
    >>
    >> Would it help if i posted my classes on here, or offer you a place to
    >> download the program.

    >
    > This is very similar to the design of java.util.HashSet, except it
    > already has methods for mapping from hashCode to bucket number that have
    > been tested with Java String.
    >
    > Is there some particular reason for rolling your own rather than using
    > the java.util class?
    >
    > Patricia


    I have to roll out my own for coursework for my data structures module.

    j1mb0jay
     
    j1mb0jay, Jan 11, 2008
    #8
  9. j1mb0jay

    Daniel Pitts Guest

    On Jan 11, 3:20 pm, j1mb0jay <> wrote:
    > Daniel Pitts wrote:
    > > On Jan 11, 3:05 pm, j1mb0jay <> wrote:
    > >> Eric Sosman wrote:
    > >>> j1mb0jay wrote:
    > >>>> I am currently working on a dictionary populating program. I currently
    > >>>> have a socket connection my local news server and am trawling through
    > >>>> all of the articles looking for new words. Java's String class has a
    > >>>> method that hashes strings. I was wondering if i should still be using
    > >>>> these even though I have over two million words in the hash table.
    > >>>> Although the hash table is currently Big 0(4).
    > >>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    > >>> by definition. What do you really mean?
    > >>>> I am using the Multiply Add and Divide (MAD) method for the
    > >>>> compression of the hash code, does Java have any built in
    > >>>> functions(methods) that will do this for me, or does anyone know of a
    > >>>> more efficient way?
    > >>> The value delivered by hashCode -- for any class, not
    > >>> just for String -- is a Java int, 32 bits wide. How (and why)
    > >>> are you "compressing" this value?
    > >> My hash table is made up of an array of n LinkedLists (where n is a
    > >> prime number that is roughly double the number of words in the dictionary).

    >
    > >> I firstly use the String.hashCode() method on a given word. I then
    > >> compress this number so that i can use it as a index into the array of
    > >> LinkedList; as this 32bit number is often far to large. I then insert
    > >> the word into the LinkedList array at the compressed value index(The
    > >> fact the hashTable is an array of LinkedLists is so that it handles
    > >> collisions)

    >
    > >> After inserting all of the words into the dictionary the largest
    > >> LinkedList in the array has only four elements. I thought Big O(4) was
    > >> the correct way of describing this.

    >
    > >> Would it help if i posted my classes on here, or offer you a place to
    > >> download the program.

    >
    > >> j1mb0jay

    >
    > > Why aren't you using the existing HashMap class?

    >
    > > If you want a compact representation of the words you come across,
    > > consider a prefix tree data structure instead.

    >
    > > Just so you know, Big O measures the dominant term without
    > > multipliers, For instance, if your algorithm takes N *n + N + 4
    > > steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)

    >
    > I have been asked to create my own data structures to help aid
    > understanding for the course material for my degree module.(Check
    > article header)
    >
    > Because i am currently building the dictionary file by trawling news
    > articles each word I pull from an article needs to be checked in the
    > dictionary to see if we already have it(I don't want to store each word
    > more than once). My current methodology means I only have to look at a
    > maximum of 4 words(out of 2.5 million) to see if I already have this
    > word stored in memory. Does this still mean my retrieval method is Big(N
    > Squared)
    >
    > j1mb0jay


    Actually, it means it has Big O(1) (As hash tables tend to)

    Don't use the hash for the index in the linked list. Don't even
    BOTHER with indexes in the linked list. The hash should be an index
    into an array of lists (of whatever sort, linked or otherwise).

    Then each list should be relatively small, so trivial to search/
    insert.
     
    Daniel Pitts, Jan 11, 2008
    #9
  10. j1mb0jay

    Roedy Green Guest

    On Fri, 11 Jan 2008 22:11:20 +0000, j1mb0jay <> wrote,
    quoted or indirectly quoted someone who said :

    >I have over two million words in the hash table


    That is not very much compared with the size of the hashCode. However,
    if you needed a bigger hashCode for some reason, there are several
    popular digest algorithms that will give you various sized results.
    See http://mindprod.com/jgloss/digest.html

    The traditional way to prune a hash code to size without losing
    "randomness" is to take the absolute value of the modulus.

    see http://mindprod.com/jgloss/hashcode.html
    http://mindprod.com/jgloss/hashtable.html
    http://mindprod.com/jgloss/hashmap.html
    for a discussion.
    --
    Roedy Green, Canadian Mind Products
    The Java Glossary, http://mindprod.com
     
    Roedy Green, Jan 11, 2008
    #10
  11. j1mb0jay

    j1mb0jay Guest

    Daniel Pitts wrote:
    > On Jan 11, 3:20 pm, j1mb0jay <> wrote:
    >> Daniel Pitts wrote:
    >>> On Jan 11, 3:05 pm, j1mb0jay <> wrote:
    >>>> Eric Sosman wrote:
    >>>>> j1mb0jay wrote:
    >>>>>> I am currently working on a dictionary populating program. I currently
    >>>>>> have a socket connection my local news server and am trawling through
    >>>>>> all of the articles looking for new words. Java's String class has a
    >>>>>> method that hashes strings. I was wondering if i should still be using
    >>>>>> these even though I have over two million words in the hash table.
    >>>>>> Although the hash table is currently Big 0(4).
    >>>>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    >>>>> by definition. What do you really mean?
    >>>>>> I am using the Multiply Add and Divide (MAD) method for the
    >>>>>> compression of the hash code, does Java have any built in
    >>>>>> functions(methods) that will do this for me, or does anyone know of a
    >>>>>> more efficient way?
    >>>>> The value delivered by hashCode -- for any class, not
    >>>>> just for String -- is a Java int, 32 bits wide. How (and why)
    >>>>> are you "compressing" this value?
    >>>> My hash table is made up of an array of n LinkedLists (where n is a
    >>>> prime number that is roughly double the number of words in the dictionary).
    >>>> I firstly use the String.hashCode() method on a given word. I then
    >>>> compress this number so that i can use it as a index into the array of
    >>>> LinkedList; as this 32bit number is often far to large. I then insert
    >>>> the word into the LinkedList array at the compressed value index(The
    >>>> fact the hashTable is an array of LinkedLists is so that it handles
    >>>> collisions)
    >>>> After inserting all of the words into the dictionary the largest
    >>>> LinkedList in the array has only four elements. I thought Big O(4) was
    >>>> the correct way of describing this.
    >>>> Would it help if i posted my classes on here, or offer you a place to
    >>>> download the program.
    >>>> j1mb0jay
    >>> Why aren't you using the existing HashMap class?
    >>> If you want a compact representation of the words you come across,
    >>> consider a prefix tree data structure instead.
    >>> Just so you know, Big O measures the dominant term without
    >>> multipliers, For instance, if your algorithm takes N *n + N + 4
    >>> steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)

    >> I have been asked to create my own data structures to help aid
    >> understanding for the course material for my degree module.(Check
    >> article header)
    >>
    >> Because i am currently building the dictionary file by trawling news
    >> articles each word I pull from an article needs to be checked in the
    >> dictionary to see if we already have it(I don't want to store each word
    >> more than once). My current methodology means I only have to look at a
    >> maximum of 4 words(out of 2.5 million) to see if I already have this
    >> word stored in memory. Does this still mean my retrieval method is Big(N
    >> Squared)
    >>
    >> j1mb0jay

    >
    > Actually, it means it has Big O(1) (As hash tables tend to)
    >
    > Don't use the hash for the index in the linked list. Don't even
    > BOTHER with indexes in the linked list. The hash should be an index
    > into an array of lists (of whatever sort, linked or otherwise).
    >
    > Then each list should be relatively small, so trivial to search/
    > insert.
    >


    My lists are currently smallish, i was just hoping someone would know a
    better hash Code function or a better compression method which would
    make each linked list have 1 element rather than up to 4 elements. So
    that would mean each linked list could be replaced with a simple
    KeyValuePair.

    Perfection means better marks :D

    j1mb0jay
     
    j1mb0jay, Jan 11, 2008
    #11
  12. j1mb0jay

    j1mb0jay Guest

    Roedy Green wrote:
    > On Fri, 11 Jan 2008 22:11:20 +0000, j1mb0jay <> wrote,
    > quoted or indirectly quoted someone who said :
    >
    >> I have over two million words in the hash table

    >
    > That is not very much compared with the size of the hashCode. However,
    > if you needed a bigger hashCode for some reason, there are several
    > popular digest algorithms that will give you various sized results.
    > See http://mindprod.com/jgloss/digest.html
    >
    > The traditional way to prune a hash code to size without losing
    > "randomness" is to take the absolute value of the modulus.
    >
    > see http://mindprod.com/jgloss/hashcode.html
    > http://mindprod.com/jgloss/hashtable.html
    > http://mindprod.com/jgloss/hashmap.html
    > for a discussion.


    Thank you, I am looking to prune the hash code rather than make it
    larger and will take a look at mindprod.

    Thanks again

    j1mb0jay
     
    j1mb0jay, Jan 11, 2008
    #12
  13. j1mb0jay wrote:
    > Daniel Pitts wrote:
    >> On Jan 11, 3:20 pm, j1mb0jay <> wrote:
    >>> Daniel Pitts wrote:
    >>>> On Jan 11, 3:05 pm, j1mb0jay <> wrote:
    >>>>> Eric Sosman wrote:
    >>>>>> j1mb0jay wrote:
    >>>>>>> I am currently working on a dictionary populating program. I
    >>>>>>> currently
    >>>>>>> have a socket connection my local news server and am trawling
    >>>>>>> through
    >>>>>>> all of the articles looking for new words. Java's String class has a
    >>>>>>> method that hashes strings. I was wondering if i should still be
    >>>>>>> using
    >>>>>>> these even though I have over two million words in the hash table.
    >>>>>>> Although the hash table is currently Big 0(4).
    >>>>>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    >>>>>> by definition. What do you really mean?
    >>>>>>> I am using the Multiply Add and Divide (MAD) method for the
    >>>>>>> compression of the hash code, does Java have any built in
    >>>>>>> functions(methods) that will do this for me, or does anyone know
    >>>>>>> of a
    >>>>>>> more efficient way?
    >>>>>> The value delivered by hashCode -- for any class, not
    >>>>>> just for String -- is a Java int, 32 bits wide. How (and why)
    >>>>>> are you "compressing" this value?
    >>>>> My hash table is made up of an array of n LinkedLists (where n is a
    >>>>> prime number that is roughly double the number of words in the
    >>>>> dictionary).
    >>>>> I firstly use the String.hashCode() method on a given word. I then
    >>>>> compress this number so that i can use it as a index into the array of
    >>>>> LinkedList; as this 32bit number is often far to large. I then insert
    >>>>> the word into the LinkedList array at the compressed value index(The
    >>>>> fact the hashTable is an array of LinkedLists is so that it handles
    >>>>> collisions)
    >>>>> After inserting all of the words into the dictionary the largest
    >>>>> LinkedList in the array has only four elements. I thought Big O(4) was
    >>>>> the correct way of describing this.
    >>>>> Would it help if i posted my classes on here, or offer you a place to
    >>>>> download the program.
    >>>>> j1mb0jay
    >>>> Why aren't you using the existing HashMap class?
    >>>> If you want a compact representation of the words you come across,
    >>>> consider a prefix tree data structure instead.
    >>>> Just so you know, Big O measures the dominant term without
    >>>> multipliers, For instance, if your algorithm takes N *n + N + 4
    >>>> steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)
    >>> I have been asked to create my own data structures to help aid
    >>> understanding for the course material for my degree module.(Check
    >>> article header)
    >>>
    >>> Because i am currently building the dictionary file by trawling news
    >>> articles each word I pull from an article needs to be checked in the
    >>> dictionary to see if we already have it(I don't want to store each word
    >>> more than once). My current methodology means I only have to look at a
    >>> maximum of 4 words(out of 2.5 million) to see if I already have this
    >>> word stored in memory. Does this still mean my retrieval method is Big(N
    >>> Squared)
    >>>
    >>> j1mb0jay

    >>
    >> Actually, it means it has Big O(1) (As hash tables tend to)
    >>
    >> Don't use the hash for the index in the linked list. Don't even
    >> BOTHER with indexes in the linked list. The hash should be an index
    >> into an array of lists (of whatever sort, linked or otherwise).
    >>
    >> Then each list should be relatively small, so trivial to search/
    >> insert.
    >>

    >
    > My lists are currently smallish, i was just hoping someone would know a
    > better hash Code function or a better compression method which would
    > make each linked list have 1 element rather than up to 4 elements. So
    > that would mean each linked list could be replaced with a simple
    > KeyValuePair.
    >
    > Perfection means better marks :D


    In that case, do a search for "perfect hash". :)

    If you have a fixed list of words, it may be possible to construct a
    hash function such that every member of the list has a different code. I
    don't know whether it is practical for 2.5 million words - perfect
    hashing is usually used for small sets of keywords.

    Patricia
     
    Patricia Shanahan, Jan 11, 2008
    #13
  14. j1mb0jay

    j1mb0jay Guest

    Patricia Shanahan wrote:
    > j1mb0jay wrote:
    >> Daniel Pitts wrote:
    >>> On Jan 11, 3:20 pm, j1mb0jay <> wrote:
    >>>> Daniel Pitts wrote:
    >>>>> On Jan 11, 3:05 pm, j1mb0jay <> wrote:
    >>>>>> Eric Sosman wrote:
    >>>>>>> j1mb0jay wrote:
    >>>>>>>> I am currently working on a dictionary populating program. I
    >>>>>>>> currently
    >>>>>>>> have a socket connection my local news server and am trawling
    >>>>>>>> through
    >>>>>>>> all of the articles looking for new words. Java's String class
    >>>>>>>> has a
    >>>>>>>> method that hashes strings. I was wondering if i should still be
    >>>>>>>> using
    >>>>>>>> these even though I have over two million words in the hash table.
    >>>>>>>> Although the hash table is currently Big 0(4).
    >>>>>>> This makes no sense. O(4) = O(1) = O(0.01) = O(1000000),
    >>>>>>> by definition. What do you really mean?
    >>>>>>>> I am using the Multiply Add and Divide (MAD) method for the
    >>>>>>>> compression of the hash code, does Java have any built in
    >>>>>>>> functions(methods) that will do this for me, or does anyone know
    >>>>>>>> of a
    >>>>>>>> more efficient way?
    >>>>>>> The value delivered by hashCode -- for any class, not
    >>>>>>> just for String -- is a Java int, 32 bits wide. How (and why)
    >>>>>>> are you "compressing" this value?
    >>>>>> My hash table is made up of an array of n LinkedLists (where n is a
    >>>>>> prime number that is roughly double the number of words in the
    >>>>>> dictionary).
    >>>>>> I firstly use the String.hashCode() method on a given word. I then
    >>>>>> compress this number so that i can use it as a index into the
    >>>>>> array of
    >>>>>> LinkedList; as this 32bit number is often far to large. I then insert
    >>>>>> the word into the LinkedList array at the compressed value index(The
    >>>>>> fact the hashTable is an array of LinkedLists is so that it handles
    >>>>>> collisions)
    >>>>>> After inserting all of the words into the dictionary the largest
    >>>>>> LinkedList in the array has only four elements. I thought Big O(4)
    >>>>>> was
    >>>>>> the correct way of describing this.
    >>>>>> Would it help if i posted my classes on here, or offer you a place to
    >>>>>> download the program.
    >>>>>> j1mb0jay
    >>>>> Why aren't you using the existing HashMap class?
    >>>>> If you want a compact representation of the words you come across,
    >>>>> consider a prefix tree data structure instead.
    >>>>> Just so you know, Big O measures the dominant term without
    >>>>> multipliers, For instance, if your algorithm takes N *n + N + 4
    >>>>> steps, then it is O(N*N). If it takes 4*n*n steps, it is still O(N*N)
    >>>> I have been asked to create my own data structures to help aid
    >>>> understanding for the course material for my degree module.(Check
    >>>> article header)
    >>>>
    >>>> Because i am currently building the dictionary file by trawling news
    >>>> articles each word I pull from an article needs to be checked in the
    >>>> dictionary to see if we already have it(I don't want to store each word
    >>>> more than once). My current methodology means I only have to look at a
    >>>> maximum of 4 words(out of 2.5 million) to see if I already have this
    >>>> word stored in memory. Does this still mean my retrieval method is
    >>>> Big(N
    >>>> Squared)
    >>>>
    >>>> j1mb0jay
    >>>
    >>> Actually, it means it has Big O(1) (As hash tables tend to)
    >>>
    >>> Don't use the hash for the index in the linked list. Don't even
    >>> BOTHER with indexes in the linked list. The hash should be an index
    >>> into an array of lists (of whatever sort, linked or otherwise).
    >>>
    >>> Then each list should be relatively small, so trivial to search/
    >>> insert.
    >>>

    >>
    >> My lists are currently smallish, i was just hoping someone would know
    >> a better hash Code function or a better compression method which would
    >> make each linked list have 1 element rather than up to 4 elements. So
    >> that would mean each linked list could be replaced with a simple
    >> KeyValuePair.
    >>
    >> Perfection means better marks :D

    >
    > In that case, do a search for "perfect hash". :)
    >
    > If you have a fixed list of words, it may be possible to construct a
    > hash function such that every member of the list has a different code. I
    > don't know whether it is practical for 2.5 million words - perfect
    > hashing is usually used for small sets of keywords.
    >
    > Patricia
    >


    I have two main ways of using the dictionary, one for inserting words
    for which i use the hash table, lucky for me i have access to servers
    with lots of RAM at university, this is why the hash table is double the
    size of the words in the dictionary. The second is when I am trying to
    break different encryption types. For which I load the words in a binary
    sorted tree where the comparable data is the popularity of each word
    during the last trawl of the newsgroups.
     
    j1mb0jay, Jan 12, 2008
    #14
  15. j1mb0jay

    Lew Guest

    j1mb0jay wrote:
    > Thank you, I am looking to prune the hash code rather than make it
    > larger and will take a look at mindprod.


    Patricia's suggestion of taking hashCode = value % k, where k is the current
    size of your hash table, is probably the best.

    --
    Lew
     
    Lew, Jan 12, 2008
    #15
  16. j1mb0jay

    j1mb0jay Guest

    Lew wrote:
    > j1mb0jay wrote:
    >> Thank you, I am looking to prune the hash code rather than make it
    >> larger and will take a look at mindprod.

    >
    > Patricia's suggestion of taking hashCode = value % k, where k is the
    > current size of your hash table, is probably the best.
    >


    Am i correct is saying this will work best if the current size of the
    hash table is a prime number ?

    j1mb0jay
     
    j1mb0jay, Jan 12, 2008
    #16
  17. j1mb0jay wrote:
    > Lew wrote:
    >> j1mb0jay wrote:
    >>> Thank you, I am looking to prune the hash code rather than make it
    >>> larger and will take a look at mindprod.

    >>
    >> Patricia's suggestion of taking hashCode = value % k, where k is the
    >> current size of your hash table, is probably the best.
    >>

    >
    > Am i correct is saying this will work best if the current size of the
    > hash table is a prime number ?


    That seems like a good subject for experiments. It is far from obvious,
    especially given the design of the String hashCode method. I would,
    however, avoid 31.

    Patricia
     
    Patricia Shanahan, Jan 12, 2008
    #17
  18. j1mb0jay

    Charles Guest

    On Jan 11, 5:11 pm, j1mb0jay <> wrote:
    > I am currently working on a dictionary populating program. I currently
    > have a socket connection my local news server and am trawling through
    > all of the articles looking for new words. Java's String class has a
    > method that hashes strings. I was wondering if i should still be using
    > these even though I have over two million words in the hash table.
    > Although the hash table is currently Big 0(4).
    >
    > I am using the Multiply Add and Divide (MAD) method for the compression
    > of the hash code, does Java have any built in functions(methods) that
    > will do this for me, or does anyone know of a more efficient way?
    >
    > j1mb0jay


    Hi Group:

    I don't know much about the MAD method, but LZW is another useful
    compression technique. I consider compression akin to folding spaces.
    I also think of greatest common divisor algorithms.

    I am thinking of it, but I am not showing how to use it. Sorry.

    If you have time implement two solutions and continuously improve.

    You could also look up compression patents.

    :)

    Good luck.
     
    Charles, Jan 12, 2008
    #18
  19. j1mb0jay

    Lew Guest

    Charles wrote:
    > I don't know much about the MAD method, but LZW is another useful
    > compression technique. I consider compression akin to folding spaces.
    > I also think of greatest common divisor algorithms.
    >
    > I am thinking of it, but I am not showing how to use it. Sorry.
    >
    > If you have time implement two solutions and continuously improve.


    Since the data to be "compressed" is exactly four bytes long, LZW and the like
    are way too heavy. Patricia's advice to use the remainder operator % is still
    the best.

    > You could also look up compression patents.


    There are at least two U.S. patents for compression algorithms that are not
    mathematically feasible. Be careful if you are searching patents.

    --
    Lew
     
    Lew, Jan 12, 2008
    #19
  20. j1mb0jay

    Roedy Green Guest

    On Fri, 11 Jan 2008 23:40:28 +0000, j1mb0jay <> wrote,
    quoted or indirectly quoted someone who said :

    >My lists are currently smallish, i was just hoping someone would know a
    >better hash Code function or a better compression method which would
    >make each linked list have 1 element rather than up to 4 elements. So
    >that would mean each linked list could be replaced with a simple
    >KeyValuePair.


    There is an algorithm to create an algorithm to get perfect
    disbursement IF you know the values in advance, called GPerf. There
    is a link to it at http://mindprod.com/jgloss/hashcode.html
    --
    Roedy Green, Canadian Mind Products
    The Java Glossary, http://mindprod.com
     
    Roedy Green, Jan 12, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    1,857
  2. Blah Blah

    java code compression

    Blah Blah, Aug 23, 2003, in forum: Java
    Replies:
    4
    Views:
    802
    Roedy Green
    Aug 23, 2003
  3. rp
    Replies:
    1
    Views:
    596
    red floyd
    Nov 10, 2011
  4. Gavin Kistner

    Hash like JS Hash (code)

    Gavin Kistner, Feb 14, 2004, in forum: Ruby
    Replies:
    8
    Views:
    261
    Gavin Kistner
    Feb 17, 2004
  5. Srijayanth Sridhar
    Replies:
    19
    Views:
    677
    David A. Black
    Jul 2, 2008
Loading...

Share This Page