optimsed HashMap

Discussion in 'Java' started by Roedy Green, Nov 24, 2012.

  1. Roedy Green

    Roedy Green Guest

    Is there something like HashMap but that optimised when nearly always
    the thing you are looking up is not in the list, and when you can add
    the list of words to look up and then freeze it.

    I have to scan an entire website looking up every word.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Students who hire or con others to do their homework are as foolish
    as couch potatoes who hire others to go to the gym for them.
    Roedy Green, Nov 24, 2012
    #1
    1. Advertising

  2. Roedy Green

    Arne Vajhøj Guest

    On 11/23/2012 8:12 PM, Roedy Green wrote:
    > Is there something like HashMap but that optimised when nearly always
    > the thing you are looking up is not in the list, and when you can add
    > the list of words to look up and then freeze it.
    >
    > I have to scan an entire website looking up every word.


    HashMap get is already O(1). There is very little
    room for optimization.

    Arne
    Arne Vajhøj, Nov 24, 2012
    #2
    1. Advertising

  3. Roedy Green

    markspace Guest

    On 11/23/2012 5:12 PM, Roedy Green wrote:
    > Is there something like HashMap but that optimised when nearly always
    > the thing you are looking up is not in the list, and when you can add
    > the list of words to look up and then freeze it.



    I'm not sure what you are trying to say there. You want the case where
    you do not find something in a hash map to be optimized? "Optimized" how?

    What do you mean "add to the list of words" and "freeze"?

    >
    > I have to scan an entire website looking up every word.
    >


    Google for "search engine." Wikipedia has an entry for it.
    markspace, Nov 24, 2012
    #3
  4. Roedy Green

    Roedy Green Guest

    On Fri, 23 Nov 2012 17:33:43 -0800, markspace <-@.> wrote, quoted or
    indirectly quoted someone who said :

    >I'm not sure what you are trying to say there. You want the case where
    >you do not find something in a hash map to be optimized? "Optimized" how?


    >What do you mean "add to the list of words" and "freeze"?


    The following is not the real problem, but it might more simply
    illustrate what I am asking.

    Think of an ordinary HashMap<String,String>

    What it does is translate a few English words with French derivation,
    putting the French accents on them. e.g. naive -> na&iuml;ve Napoleon
    -> Napol&acute;on

    Let us say you have 100 such words you want to transform. (In my
    actual problem I have about 1500 words).

    You go through the files for a website looking at each word of text
    (avoiding HTML markup) in the HashMap. If you find it you replace it.

    Most of the time word you look up is not in the list.

    This is a time-consuming process. I would like to speed it up.

    My lookup has two properties that might be exploited in some variant
    HashMap.

    1. nearly always the lookup fails. The code should be optimised for
    this case. If it has some fast way of knowing the elt is not there,
    it should do that first.

    2. the list of words to lookup does not change after initial
    preparation. I can afford to do some special calculation to prime the
    lookup. For example, I once heard of some tweaking to avoid long
    collision chains for a C implementation of HashMap.

    My question had two purposes. To see if there was something available
    off the shelf, and to stimulate thought on some new algorithm that
    could have wider application that just my problem.

    Another way of looking at the problem is it would be nice to have a
    HashSet implementation that was considerably faster than a HashMap.
    IIRC, currently HashSet is implemented as a HashMap.

    Such an algorithm could be used to fix your most common spelling
    mistakes, to add links to magic words, to add markup to magic words
    to find and report the presence of certain words, or in my case find
    acronyms and replace them with a macro for that acronym that displays
    the meaning of the acronym the first time it is used on a page.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Students who hire or con others to do their homework are as foolish
    as couch potatoes who hire others to go to the gym for them.
    Roedy Green, Nov 24, 2012
    #4
  5. Roedy Green wrote:
    > Is there something like HashMap but that optimised when nearly always
    > the thing you are looking up is not in the list, and when you can add
    > the list of words to look up and then freeze it.
    >
    > I have to scan an entire website looking up every word.


    Unless said web site is www.archive.org, just create a HashMap
    and use a Collections.unmodifiableMap() wrapper for lookup. I'd
    expect memory to become a problem way earlier than processor.

    Depending on how often each String is looked up, and how long the
    Strings are, a data structure that does not need String.hashCode()
    might be faster.

    I'd consider a tree along the following lines:

    Characters not in [a-zA-Z] are expressed as `nnnn - the ASCII value of
    '`' is 'a' - 1. Characters in [A-Z] are lowercased. Nodes contain a
    boolean and a Node[27]. The tree contains a word if there is a path
    from the root to a Node whose boolean is true and that is reached via
    Node[String.charAt(l)] at level l.

    For example, a tree containing exactly
    "a", "at", "and", "no", "nil", "not"
    would be

    N0 = { false, [ null, N1, 12*null, N2, 12*null ] } // ""
    N1 = { true, [ null, 13*null, N3, 5*null, N4, 6*null ] } // "a"
    N2 = { false, [ null, 8*null, N5, 5*null, N6, 11*null ] } // "n"
    N3 = { false, [ null, 13*null, N7, 12*null ] } // "an"
    N4 = { true, [ null, 26*null ] } // "at"
    N5 = { false, [ null, 11*null, N8, 14*null ] } // "ni"
    N6 = { true, [ null, 19*null, N9, 6*null ] } // "no"
    N7 = { true, [ null, 26*null ] } // "and"
    N8 = { true, [ null, 26*null ] } // "nil"
    N9 = { true, [ null, 26*null ] } // "not"

    For a Map, replace the boolean with a field holding the value.

    Possible improvements:

    - Since we have three (32bit VM) or seven (64bit VM) padding bytes
    in each Node, we could use two of them to hold a short indicating
    the maximum length of Strings in the subtree.

    - Instantiate Node[] only if actually needed to hold a child, and only
    large enough to hold all currently known children. Use one of the
    padding bytes to hold an index offset, such that a Node with only
    a single child only holds a Node[1]:
    N4 = { true, 0, null }
    N6 = { true, 20, [ N9 ] }

    - Use an interface and specialized concrete classes for
    - leaf node (no fields at all, because boolean will always be true)
    - single child node { boolean, Node }
    or even "true" and "false" single child nodes
    - ...
    but this will considerable increase preprocessing cost.

    - If many of your words contain non-[a-z] characters, provide array
    positions for the most common of these, or use a linear probing
    hash table for the non-[a-z] children.

    - If the tree is very sparse, doing so even for the [a-z] children
    might be better in the long run because more of it fits into caches.

    This surely has been invented before and given a nice name...
    Apache's or Google's collection libraries might even feature
    high quality implementations.

    --

    "I'm a doctor, not a mechanic." Dr Leonard McCoy <>
    "I'm a mechanic, not a doctor." Volker Borchert <>
    Volker Borchert, Nov 24, 2012
    #5
  6. Roedy Green

    Roedy Green Guest

    On Fri, 23 Nov 2012 22:42:40 -0800, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >Another way of looking at the problem is it would be nice to have a
    >HashSet implementation that was considerably faster than a HashMap.
    >IIRC, currently HashSet is implemented as a HashMap.


    A a 3- valued HashSet.contains would still be useful
    no -- not in set
    yes - in set
    maybe -- don't know

    At least you could eliminate the no's from further processing.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Students who hire or con others to do their homework are as foolish
    as couch potatoes who hire others to go to the gym for them.
    Roedy Green, Nov 24, 2012
    #6
  7. Roedy Green

    Roedy Green Guest

    On Sat, 24 Nov 2012 10:21:14 -0000, "Chris Uppal"
    <-THIS.org> wrote, quoted or indirectly
    quoted someone who said :

    >Look into the literature on fast text searching (for instance bit-parallel
    >matching). It's not entirely clear to me what Roedy is trying to do, but it
    >sounds as if "bulk" matching/searching might be relevant.


    Yes a Boyer-Moore to simultaneously search for the whole list of
    words, then when it has a hit see if it has word in isolation rather
    than a word fragment.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Students who hire or con others to do their homework are as foolish
    as couch potatoes who hire others to go to the gym for them.
    Roedy Green, Nov 24, 2012
    #7
  8. On 24.11.2012 12:39, Roedy Green wrote:
    > On Sat, 24 Nov 2012 10:21:14 -0000, "Chris Uppal"
    > <-THIS.org> wrote, quoted or indirectly
    > quoted someone who said :
    >
    >> Look into the literature on fast text searching (for instance bit-parallel
    >> matching). It's not entirely clear to me what Roedy is trying to do, but it
    >> sounds as if "bulk" matching/searching might be relevant.

    >
    > Yes a Boyer-Moore to simultaneously search for the whole list of
    > words, then when it has a hit see if it has word in isolation rather
    > than a word fragment.


    Here's another approach:

    1. fill a HashMap with the translations.
    2. Create a tree or trie from the keys.
    3. Convert the trie to a regular expression optimized for NFA automata
    (such as is used in Java std. library).
    4. Surround the regexp with additional regexp to ensure word matches and
    probably exclude matching inside HTML tags
    5. Scan the document with Matcher.find()

    The idea of item 3 is to create a regexp with as little backtracking as
    possible. For example, from

    foo
    foot
    fuss

    you make

    (?:f(?:eek:ot?)|uss)

    Not sure though whether it is dramatically faster or slower than a
    standard string search like Boyer-Moore - probably not.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Nov 24, 2012
    #8
  9. On 11/24/2012 3:34 AM, Roedy Green wrote:
    > On Fri, 23 Nov 2012 22:42:40 -0800, Roedy Green
    > <> wrote, quoted or indirectly quoted
    > someone who said :
    >
    >> Another way of looking at the problem is it would be nice to have a
    >> HashSet implementation that was considerably faster than a HashMap.
    >> IIRC, currently HashSet is implemented as a HashMap.

    >
    > A a 3- valued HashSet.contains would still be useful
    > no -- not in set
    > yes - in set
    > maybe -- don't know
    >
    > At least you could eliminate the no's from further processing.
    >


    What about a sorted map? I would think searching a sorted map would be
    much quicker than an unsorted one.

    --

    Knute Johnson
    Knute Johnson, Nov 24, 2012
    #9
  10. Roedy Green

    Arne Vajhøj Guest

    On 11/23/2012 10:51 PM, Patricia Shanahan wrote:
    > On 11/23/2012 5:12 PM, Roedy Green wrote:
    >> Is there something like HashMap but that optimised when nearly always
    >> the thing you are looking up is not in the list, and when you can add
    >> the list of words to look up and then freeze it.
    >>
    >> I have to scan an entire website looking up every word.

    >
    > Look up "perfect hash". The idea is that, given a set of strings, you
    > can calculate a hash function that maps each of those strings to a
    > different bucket. That avoids any chain searching due to bucket
    > collisions, and simplifies the data structures.


    I would expects gains by using a better hash function
    compared to the standard Java one to be very small for
    String's containing words (English and Java code).

    Arne
    Arne Vajhøj, Nov 24, 2012
    #10
  11. Roedy Green

    Arne Vajhøj Guest

    On 11/24/2012 1:42 AM, Roedy Green wrote:
    > On Fri, 23 Nov 2012 17:33:43 -0800, markspace <-@.> wrote, quoted or
    > indirectly quoted someone who said :
    >
    >> I'm not sure what you are trying to say there. You want the case where
    >> you do not find something in a hash map to be optimized? "Optimized" how?

    >
    >> What do you mean "add to the list of words" and "freeze"?

    >
    > The following is not the real problem, but it might more simply
    > illustrate what I am asking.
    >
    > Think of an ordinary HashMap<String,String>
    >
    > What it does is translate a few English words with French derivation,
    > putting the French accents on them. e.g. naive -> na&iuml;ve Napoleon
    > -> Napol&acute;on
    >
    > Let us say you have 100 such words you want to transform. (In my
    > actual problem I have about 1500 words).
    >
    > You go through the files for a website looking at each word of text
    > (avoiding HTML markup) in the HashMap. If you find it you replace it.
    >
    > Most of the time word you look up is not in the list.
    >
    > This is a time-consuming process. I would like to speed it up.


    Reading the HTML files, splitting them up in words and discarding
    HTML seems more likely to be the bottleneck than the lookups in a
    HashMap.

    What does your measurements show CPU time and wall time
    distributed between the different activities?

    You did measure right??

    > My lookup has two properties that might be exploited in some variant
    > HashMap.
    >
    > 1. nearly always the lookup fails. The code should be optimised for
    > this case. If it has some fast way of knowing the elt is not there,
    > it should do that first.


    HashMap should already do that.

    > 2. the list of words to lookup does not change after initial
    > preparation. I can afford to do some special calculation to prime the
    > lookup. For example, I once heard of some tweaking to avoid long
    > collision chains for a C implementation of HashMap.


    If you have long collision chains then you can obviously improve
    by choosing a better hash functions.

    But do you have long collision chains?

    Java String hash is not that bad.

    You can easily test number of collisions.

    > Another way of looking at the problem is it would be nice to have a
    > HashSet implementation that was considerably faster than a HashMap.
    > IIRC, currently HashSet is implemented as a HashMap.


    Not carrying data will not make it considerable faster.

    Hashing and lookup are the same.

    Arne
    Arne Vajhøj, Nov 24, 2012
    #11
  12. Roedy Green

    markspace Guest

    On 11/23/2012 10:42 PM, Roedy Green wrote:

    > My question had two purposes. To see if there was something available
    > off the shelf, and to stimulate thought on some new algorithm that
    > could have wider application that just my problem.



    I think it's unlikely that you're going to get a better algorithm than
    Boyer-Moore or some other standard algorithm. Those are standard
    algorithms for a reason: no one's come up with anything better.

    You might try compiling a regex to do it. Regex can be pretty efficient
    in many cases. Nothing can beat a hand code parser though, so if you
    really need speed, that's the ticket.

    Example regex:

    "\\b(word1|word2|word3|...etc.)\\b"

    You also might consider the case where you do have several matches in
    your character buffer. If you use a string and standard replacement,
    you'll copy the entire buffer for each match. If the buffer is large,
    that's a lot of time spent just copying the same bytes around. Look
    into making your own buffer/string that doesn't require making copies
    for each replacement. However if you really do seldom have a match,
    this might not be worth your time investment.
    markspace, Nov 24, 2012
    #12
  13. On 11/24/2012 08:39 AM, Knute Johnson wrote:
    > On 11/24/2012 3:34 AM, Roedy Green wrote:
    >> On Fri, 23 Nov 2012 22:42:40 -0800, Roedy Green
    >> <> wrote, quoted or indirectly quoted
    >> someone who said :
    >>
    >>> Another way of looking at the problem is it would be nice to have a
    >>> HashSet implementation that was considerably faster than a HashMap.
    >>> IIRC, currently HashSet is implemented as a HashMap.

    >>
    >> A a 3- valued HashSet.contains would still be useful
    >> no -- not in set
    >> yes - in set
    >> maybe -- don't know
    >>
    >> At least you could eliminate the no's from further processing.
    >>

    >
    > What about a sorted map? I would think searching a sorted map would be
    > much quicker than an unsorted one.
    >


    Actually it's not faster, I tried it.

    knute...
    Knute Johnson, Nov 24, 2012
    #13
  14. On 25.11.2012 14:50, Chris Uppal wrote:
    > Robert Klemme wrote:


    > Alternatively, use a real DFA matcher which will do all of that work for you as
    > it creates a minimal (or maybe only nearly mininal) DFA.


    Right, that's probably an even better option. :)

    >> Not sure though whether it is dramatically faster or slower than a standard
    >> string search like Boyer-Moore - probably not.

    >
    > As I mentioned elsewhere BM is no longer considered the best (there's a nice,
    > albeit somewhat specialised, book by Navarro and Raffinot "Flexible Pattern
    > Matching in Strings").


    Thank you for that pointer, Chris!

    > But my gut feeling (another way of saying that I
    > haven't bothered to test it) is that IO would probably dominate the exection
    > time whatever algorithm was used (assuming it wasn't implemented
    > inefficiently).


    Yes, that seems likely.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
    http://blog.rubybestpractices.com/
    Robert Klemme, Nov 25, 2012
    #14
  15. Roedy Green

    Silvio Guest

    On 11/24/2012 02:12 AM, Roedy Green wrote:
    > Is there something like HashMap but that optimised when nearly always
    > the thing you are looking up is not in the list, and when you can add
    > the list of words to look up and then freeze it.
    >
    > I have to scan an entire website looking up every word.
    >


    The use case you describe sounds like a good fit for a character trie.
    The root represents the empty input string and each node has a child
    node for each character extension of its current input path that needs
    to be mapped. A node can optionally have a replacement string if it
    marks the end of a word-to-be-mapped. If it doesn't than it marks a
    sub-word only.
    You simply read word characters and step down into the trie. If you walk
    off the tree halfway through a word or if you end on a node that has no
    replacement value then the word does not exist and you need to copy it
    to your output. If the last character of a word leads to a node with a
    replacement string than that is what you output.

    Setting up the trie might take more time than setting up a plain word
    map but the processing should be a lot faster.

    Silvio
    Silvio, Nov 26, 2012
    #15
  16. Roedy Green

    Jim Janney Guest

    Roedy Green <> writes:

    > Is there something like HashMap but that optimised when nearly always
    > the thing you are looking up is not in the list, and when you can add
    > the list of words to look up and then freeze it.
    >
    > I have to scan an entire website looking up every word.


    In this case, where you are always looking up a newly created string,
    it's likely that computing the hash code takes more time than the actual
    lookup: java.lang.String.hashCode() looks at every character in the
    string. You could experiment with a weaker hash code that is cheaper to
    compute but generates more collisions, for example only hash the first n
    characters for some small value of n.

    Or take the length of the string (very cheap to compute) and check it
    against a BitSet. Or do the same with the first (or last) character.
    How well these strategies work will depend very much on the actual data.

    --
    Jim Janney
    Jim Janney, Nov 26, 2012
    #16
  17. On 24/11/2012 07:42, Roedy Green allegedly wrote:
    > On Fri, 23 Nov 2012 17:33:43 -0800, markspace <-@.> wrote, quoted or
    > indirectly quoted someone who said :
    >
    >> I'm not sure what you are trying to say there. You want the case where
    >> you do not find something in a hash map to be optimized? "Optimized" how?

    >
    >> What do you mean "add to the list of words" and "freeze"?

    >
    > The following is not the real problem, but it might more simply
    > illustrate what I am asking.
    >
    > Think of an ordinary HashMap<String,String>
    >
    > What it does is translate a few English words with French derivation,
    > putting the French accents on them. e.g. naive -> na&iuml;ve Napoleon
    > -> Napol&acute;on
    >
    > Let us say you have 100 such words you want to transform. (In my
    > actual problem I have about 1500 words).
    >
    > You go through the files for a website looking at each word of text
    > (avoiding HTML markup) in the HashMap. If you find it you replace it.
    >
    > Most of the time word you look up is not in the list.
    >
    > This is a time-consuming process. I would like to speed it up.


    You might want to intern() the input to avoid having to recompute the
    hash every time (if applicable). Other than that, you'll either be
    wanting a better hashing algorithm, to avoid collisions, or indeed
    something altogether fancier (but riskier in terms or RoI).

    *shrug*

    --
    DF.
    Daniele Futtorovic, Nov 26, 2012
    #17
  18. On 11/26/2012 10:03 PM, Daniele Futtorovic wrote:
    > On 24/11/2012 07:42, Roedy Green allegedly wrote:


    >> You go through the files for a website looking at each word of text
    >> (avoiding HTML markup) in the HashMap. If you find it you replace it.
    >>
    >> Most of the time word you look up is not in the list.
    >>
    >> This is a time-consuming process. I would like to speed it up.

    >
    > You might want to intern() the input to avoid having to recompute the
    > hash every time (if applicable). Other than that, you'll either be
    > wanting a better hashing algorithm, to avoid collisions, or indeed
    > something altogether fancier (but riskier in terms or RoI).


    How would interning help? The input is read only once anyway and if you
    mean to intern individual words of the input then how does the JVM do
    the interning? My guess would be that some form of hashing would be
    used there as well - plus that internal data structure must be thread
    safe...

    Kind regards

    robert
    Robert Klemme, Nov 26, 2012
    #18
  19. Roedy Green

    Daniel Pitts Guest

    On 11/23/12 5:12 PM, Roedy Green wrote:
    > Is there something like HashMap but that optimised when nearly always
    > the thing you are looking up is not in the list, and when you can add
    > the list of words to look up and then freeze it.
    >
    > I have to scan an entire website looking up every word.
    >


    I don't think your use-case needs this kind of (read "micro")
    optimization. However, I have often wished for a way to get a Map.Entry
    for a key, which could then be used to insert a new value efficiently.

    Map.Entry<K,V> getOrAddEntry(K key);


    Map.Entry<K,V> e = map.getOrAddEntry(myKey);

    if (e.getValue() == null) {
    e.setValue(newValue(myKey));
    }

    useValue(e.getValue());

    Alas, not exactly possible to add it now.
    Daniel Pitts, Nov 26, 2012
    #19
  20. Roedy Green

    Eric Sosman Guest

    On 11/26/2012 6:44 PM, Daniel Pitts wrote:
    > On 11/23/12 5:12 PM, Roedy Green wrote:
    >> Is there something like HashMap but that optimised when nearly always
    >> the thing you are looking up is not in the list, and when you can add
    >> the list of words to look up and then freeze it.
    >>
    >> I have to scan an entire website looking up every word.
    >>

    >
    > I don't think your use-case needs this kind of (read "micro")
    > optimization. However, I have often wished for a way to get a Map.Entry
    > for a key, which could then be used to insert a new value efficiently.
    >
    > Map.Entry<K,V> getOrAddEntry(K key);
    >
    >
    > Map.Entry<K,V> e = map.getOrAddEntry(myKey);
    >
    > if (e.getValue() == null) {
    > e.setValue(newValue(myKey));
    > }
    >
    > useValue(e.getValue());
    >
    > Alas, not exactly possible to add it now.


    java.util.concurrent.ConcurrentMap has putIfAbsent().

    --
    Eric Sosman
    d
    Eric Sosman, Nov 27, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sanjay Kumar

    HashMap

    Sanjay Kumar, Jul 4, 2003, in forum: Java
    Replies:
    2
    Views:
    3,473
    Sanjay Kumar
    Jul 5, 2003
  2. Jon Skeet
    Replies:
    5
    Views:
    2,137
    Dale King
    Jul 8, 2003
  3. Ahmed Moustafa

    HashMap vs TreeMap

    Ahmed Moustafa, Aug 9, 2003, in forum: Java
    Replies:
    2
    Views:
    46,384
    Roedy Green
    Aug 10, 2003
  4. Vince Darley
    Replies:
    4
    Views:
    4,383
    emilchacko
    Mar 2, 2010
  5. Rakesh
    Replies:
    10
    Views:
    12,144
    Mike Schilling
    Apr 8, 2008
Loading...

Share This Page