Summating Strings

Discussion in 'Java' started by Mike, Dec 22, 2006.

  1. Mike

    Mike Guest

    Please bear with me. I'm a bit new to Java. I have a list of strings
    that I'd like to summate. For example I'd like to take this data:

    Apple
    Orange
    Apple
    Apple
    Pear
    Peach
    Banana
    Banana

    And generate this data:

    Term Count
    ---- -----
    Apple 3
    Orange 1
    Pear 1
    Peach 1
    Banana 2

    I'd then like to sort this data so the items with the higher counts come
    first in the list. The list of source strings can be anywhere from 1 -
    20000 strings having anywhere from 1 - 500 unique terms. The average
    set will be roughly 300 strings with about 30 unique terms. I am
    currently doing this now by looping through the strings and keeping the
    term/count data in an ArrayList. I use a binary search to determine if
    the string is already in my list. If not I insert the string into the
    ArrayList in the appropriate location to keep the list sorted so the
    binary search will continue to work.

    My method is fast, but I'm hoping it can be faster. I'm running this
    code in a web app and it needs to be as fast as possible. The data
    constantly changes so caching is not an option. It's also not coming
    from a SQL data source so performing a query to generate this is also
    not an option.

    What I'm wondering is if there's some innate Java class that does this
    sort of thing already and hopefully does it much faster than I'm doing
    it. The code needs to work with Java Version 1.4.2_09. I've looked at
    TreeMap and some of the other various collection classes, but I'm not
    sure from the API docs which one might be best suited for my scenario.
    I'm hoping the collective wisdom of this group can help save me some
    time and aggravation.

    Thanks in advance!

    - Mike
    Mike, Dec 22, 2006
    #1
    1. Advertising

  2. Mike wrote:
    > Please bear with me. I'm a bit new to Java. I have a list of strings
    > that I'd like to summate. For example I'd like to take this data:
    >
    > Apple
    > Orange
    > Apple
    > Apple
    > Pear
    > Peach
    > Banana
    > Banana
    >
    > And generate this data:
    >
    > Term Count
    > ---- -----
    > Apple 3
    > Orange 1
    > Pear 1
    > Peach 1
    > Banana 2
    >
    > I'd then like to sort this data so the items with the higher counts come
    > first in the list. The list of source strings can be anywhere from 1 -
    > 20000 strings having anywhere from 1 - 500 unique terms. The average
    > set will be roughly 300 strings with about 30 unique terms. I am
    > currently doing this now by looping through the strings and keeping the
    > term/count data in an ArrayList. I use a binary search to determine if
    > the string is already in my list. If not I insert the string into the
    > ArrayList in the appropriate location to keep the list sorted so the
    > binary search will continue to work.


    Here is a little class I wrote for a similar problem. Once you have the
    set of keys, you can sort it using a Comparator that orders based on the
    associated count. You will need to modify it to remove the generics to
    use with 1.4. I'd be interested in hearing whether it is faster or
    slower than your current solution.

    /*
    * Created on Aug 14, 2004
    */
    package utilities;

    import java.util.Collections;
    import java.util.HashMap;
    import java.util.Map;
    import java.util.Set;

    /**
    * Utility class for counting instances of equal objects.
    */
    public class Counter {

    private Map<Object,Count> data = new HashMap<Object,Count>();

    /**
    * Increment by one the count associated with a specified
    * key.
    * @param key
    */
    public void increment(Object key) {
    Count count = data.get(key);
    if (count == null) {
    count = new Count();
    data.put(key, count);
    }
    count.increment();
    }

    /**
    * Get the count associated with a specified key.
    * @param key The key whose count is required.
    * @return The number of times increment has been called with
    * a key equal to this one.
    */
    public int get(Object key) {
    Count count = data.get(key);
    if (count == null) {
    return 0;
    } else {
    return count.get();
    }
    }

    /**
    * Get number of unique counted objects
    *
    * @return Number of key objects for which increment
    * has been called. Equal objects are only counted once.
    */
    public int size() {
    return data.size();
    }

    /**
    * Get all the unique counted objects
    * @return A set containing a refence to the counted objects.
    */
    public Set getKeys() {
    return Collections.unmodifiableSet(data.keySet());
    }

    private static class Count {
    private int val = 0;

    private void increment() {
    val++;
    }

    private int get() {
    return val;
    }
    }

    }
    Patricia Shanahan, Dec 22, 2006
    #2
    1. Advertising

  3. Mike

    Stefan Ram Guest

    Mike <> writes:
    >Term Count
    >---- -----
    >Apple 3
    >Orange 1


    So you want

    sort | uniq -c

    You might use something like

    class NumericMapUtils
    { public static <D> void addTo /* autovivificate the value to 0 */
    ( final java.util.Map<D,java.lang.Integer> map, final D d, final int i )
    { map.put( d, i +( map.containsKey( d )? map.get( d ): 0 )); }}

    and a sorted map

    java.util.TreeMap<java.lang.String,java.lang.Integer> map;

    then add each text like

    NumericMapUtils.addTo<java.lang.String>( map, "Apple", 1 );
    NumericMapUtils.addTo<java.lang.String>( map, "Orange", 1 );
    NumericMapUtils.addTo<java.lang.String>( map, "Apple", 1 );
    NumericMapUtils.addTo<java.lang.String>( map, "Apple", 1 );

    Then iterate: »for( final java.lang.String key: map.keySet() )«
    Stefan Ram, Dec 22, 2006
    #3
  4. Mike

    Mike Guest

    Thanks for the replies guys! I'll give them a shot and see how they
    perform.

    - Mike
    Mike, Dec 22, 2006
    #4
  5. Mike

    Hemal Pandya Guest

    Now I am confused.

    Mike wrote:
    [....]
    > I'd then like to sort this data so the items with the higher counts come
    > first in the list.


    Neither of the suggested solutions would meet this condition. Or would
    they? What am I missing?

    You need a collection of (word, frequency). When building the
    collection, you want it searchable by the word. When reporting, you
    need it sorted by frequency. One way is to sort the collection after
    the scan. Would that be worse then maintaining two collections during
    scan? I don't know...

    If I remember right, the K&R C Programming Language book has a sample
    code for word frequency.
    Hemal Pandya, Dec 22, 2006
    #5
  6. Mike

    Mike Guest

    You are correct, they don't. However, I can still see if they're any
    faster at building the list. Currently I am sorting as a separate step so
    I can still compare the speed of the various methods. I included the part
    about sorting in the original post because I thought it might influence the
    method used.

    - Mike

    > Neither of the suggested solutions would meet this condition. Or would
    > they? What am I missing?
    >
    Mike, Dec 22, 2006
    #6
  7. Hemal Pandya wrote:
    > Now I am confused.
    >
    > Mike wrote:
    > [....]
    >> I'd then like to sort this data so the items with the higher counts come
    >> first in the list.

    >
    > Neither of the suggested solutions would meet this condition. Or would
    > they? What am I missing?


    Maybe you missed my comment "Once you have the set of keys, you can sort
    it using a Comparator that orders based on the associated count."?

    >
    > You need a collection of (word, frequency). When building the
    > collection, you want it searchable by the word. When reporting, you
    > need it sorted by frequency. One way is to sort the collection after
    > the scan. Would that be worse then maintaining two collections during
    > scan? I don't know...


    A continuously sorted collection would be quite expensive to maintain,
    especially considering the statistics stated in the base message: "The
    list of source strings can be anywhere from 1 - 20000 strings having
    anywhere from 1 - 500 unique terms. The average set will be roughly 300
    strings with about 30 unique terms."

    Maintaining sorted order requires adjustment of the structure for every
    input, 300 times in the average case. A HashMap indexed by string has
    one insert for each unique string, 30 times in the average case,
    followed by a 30 element sort at the end. The remaining 270 operations
    do a HashMap get followed by an add.

    The OP did say "I'd then like to sort this data so the items with the
    higher counts come first in the list." which suggests to me that
    solutions that first collect the data and then sort meet the requirements.

    >
    > If I remember right, the K&R C Programming Language book has a sample
    > code for word frequency.
    >


    I don't think K&R gave enough attention to the java.util package to be a
    really good guide to Java programming in this area. :)

    Patricia
    Patricia Shanahan, Dec 22, 2006
    #7
  8. Mike

    Hemal Pandya Guest

    Patricia Shanahan wrote:
    [...]
    >
    > Maybe you missed my comment "Once you have the set of keys, you can sort
    > it using a Comparator that orders based on the associated count."?


    Of course. I didn't think it likely that you would not have addressed
    it. Perhaps I was thrown off because reading the above to mean to sort
    the keys and I was stuck with the through that the collection of pairs
    needs to be sorted.

    Thank you for the explanation of performance issue.
    Hemal Pandya, Dec 23, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kurt Krueckeberg
    Replies:
    2
    Views:
    710
    =?ISO-8859-1?Q?Ney_Andr=E9_de_Mello_Zunino?=
    Nov 17, 2004
  2. Rick

    Comparing strings from within strings

    Rick, Oct 21, 2003, in forum: C Programming
    Replies:
    3
    Views:
    380
    Irrwahn Grausewitz
    Oct 21, 2003
  3. Klaus Neuner
    Replies:
    7
    Views:
    488
    Klaus Neuner
    Jul 26, 2004
  4. Girish Sahani
    Replies:
    17
    Views:
    573
    Boris Borcic
    Jun 9, 2006
  5. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    757
    Malcolm
    Jun 24, 2006
Loading...

Share This Page