Summating Strings

Mike · Dec 22, 2006

Please bear with me. I'm a bit new to Java. I have a list of strings
that I'd like to summate. For example I'd like to take this data:

Apple
Orange
Apple
Apple
Pear
Peach
Banana
Banana

And generate this data:

Term Count
---- -----
Apple 3
Orange 1
Pear 1
Peach 1
Banana 2

I'd then like to sort this data so the items with the higher counts come
first in the list. The list of source strings can be anywhere from 1 -
20000 strings having anywhere from 1 - 500 unique terms. The average
set will be roughly 300 strings with about 30 unique terms. I am
currently doing this now by looping through the strings and keeping the
term/count data in an ArrayList. I use a binary search to determine if
the string is already in my list. If not I insert the string into the
ArrayList in the appropriate location to keep the list sorted so the
binary search will continue to work.

My method is fast, but I'm hoping it can be faster. I'm running this
code in a web app and it needs to be as fast as possible. The data
constantly changes so caching is not an option. It's also not coming
from a SQL data source so performing a query to generate this is also
not an option.

What I'm wondering is if there's some innate Java class that does this
sort of thing already and hopefully does it much faster than I'm doing
it. The code needs to work with Java Version 1.4.2_09. I've looked at
TreeMap and some of the other various collection classes, but I'm not
sure from the API docs which one might be best suited for my scenario.
I'm hoping the collective wisdom of this group can help save me some
time and aggravation.

Thanks in advance!

- Mike

Patricia Shanahan · Dec 22, 2006

Mike said:
Please bear with me. I'm a bit new to Java. I have a list of strings
that I'd like to summate. For example I'd like to take this data:

Apple
Orange
Apple
Apple
Pear
Peach
Banana
Banana

And generate this data:

Term Count
---- -----
Apple 3
Orange 1
Pear 1
Peach 1
Banana 2

I'd then like to sort this data so the items with the higher counts come
first in the list. The list of source strings can be anywhere from 1 -
20000 strings having anywhere from 1 - 500 unique terms. The average
set will be roughly 300 strings with about 30 unique terms. I am
currently doing this now by looping through the strings and keeping the
term/count data in an ArrayList. I use a binary search to determine if
the string is already in my list. If not I insert the string into the
ArrayList in the appropriate location to keep the list sorted so the
binary search will continue to work.

Here is a little class I wrote for a similar problem. Once you have the
set of keys, you can sort it using a Comparator that orders based on the
associated count. You will need to modify it to remove the generics to
use with 1.4. I'd be interested in hearing whether it is faster or
slower than your current solution.

/*
* Created on Aug 14, 2004
*/
package utilities;

import java.util.Collections;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;

/**
* Utility class for counting instances of equal objects.
*/
public class Counter {

private Map<Object,Count> data = new HashMap<Object,Count>();

/**
* Increment by one the count associated with a specified
* key.
* @param key
*/
public void increment(Object key) {
Count count = data.get(key);
if (count == null) {
count = new Count();
data.put(key, count);
}
count.increment();
}

/**
* Get the count associated with a specified key.
* @param key The key whose count is required.
* @return The number of times increment has been called with
* a key equal to this one.
*/
public int get(Object key) {
Count count = data.get(key);
if (count == null) {
return 0;
} else {
return count.get();
}
}

/**
* Get number of unique counted objects
*
* @return Number of key objects for which increment
* has been called. Equal objects are only counted once.
*/
public int size() {
return data.size();
}

/**
* Get all the unique counted objects
* @return A set containing a refence to the counted objects.
*/
public Set getKeys() {
return Collections.unmodifiableSet(data.keySet());
}

private static class Count {
private int val = 0;

private void increment() {
val++;
}

private int get() {
return val;
}
}

}

Stefan Ram · Dec 22, 2006

Mike said:
Term Count
---- -----
Apple 3
Orange 1

So you want

sort | uniq -c

You might use something like

class NumericMapUtils
{ public static <D> void addTo /* autovivificate the value to 0 */
( final java.util.Map<D,java.lang.Integer> map, final D d, final int i )
{ map.put( d, i +( map.containsKey( d )? map.get( d ): 0 )); }}

and a sorted map

java.util.TreeMap<java.lang.String,java.lang.Integer> map;

then add each text like

NumericMapUtils.addTo<java.lang.String>( map, "Apple", 1 );
NumericMapUtils.addTo<java.lang.String>( map, "Orange", 1 );
NumericMapUtils.addTo<java.lang.String>( map, "Apple", 1 );
NumericMapUtils.addTo<java.lang.String>( map, "Apple", 1 );

Then iterate: »for( final java.lang.String key: map.keySet() )«

Mike · Dec 22, 2006

Thanks for the replies guys! I'll give them a shot and see how they
perform.

- Mike

Hemal Pandya · Dec 22, 2006

Now I am confused.

Mike wrote:
[....]

I'd then like to sort this data so the items with the higher counts come
first in the list.

Neither of the suggested solutions would meet this condition. Or would
they? What am I missing?

You need a collection of (word, frequency). When building the
collection, you want it searchable by the word. When reporting, you
need it sorted by frequency. One way is to sort the collection after
the scan. Would that be worse then maintaining two collections during
scan? I don't know...

If I remember right, the K&R C Programming Language book has a sample
code for word frequency.

Mike · Dec 22, 2006

You are correct, they don't. However, I can still see if they're any
faster at building the list. Currently I am sorting as a separate step so
I can still compare the speed of the various methods. I included the part
about sorting in the original post because I thought it might influence the
method used.

- Mike

Patricia Shanahan · Dec 22, 2006

Hemal said:
Now I am confused.

Mike wrote:
[....]

I'd then like to sort this data so the items with the higher counts come
first in the list.

Click to expand...

Neither of the suggested solutions would meet this condition. Or would
they? What am I missing?

Maybe you missed my comment "Once you have the set of keys, you can sort
it using a Comparator that orders based on the associated count."?

You need a collection of (word, frequency). When building the
collection, you want it searchable by the word. When reporting, you
need it sorted by frequency. One way is to sort the collection after
the scan. Would that be worse then maintaining two collections during
scan? I don't know...

A continuously sorted collection would be quite expensive to maintain,
especially considering the statistics stated in the base message: "The
list of source strings can be anywhere from 1 - 20000 strings having
anywhere from 1 - 500 unique terms. The average set will be roughly 300
strings with about 30 unique terms."

Maintaining sorted order requires adjustment of the structure for every
input, 300 times in the average case. A HashMap indexed by string has
one insert for each unique string, 30 times in the average case,
followed by a 30 element sort at the end. The remaining 270 operations
do a HashMap get followed by an add.

The OP did say "I'd then like to sort this data so the items with the
higher counts come first in the list." which suggests to me that
solutions that first collect the data and then sort meet the requirements.

If I remember right, the K&R C Programming Language book has a sample
code for word frequency.

I don't think K&R gave enough attention to the java.util package to be a
really good guide to Java programming in this area.

Patricia

Hemal Pandya · Dec 23, 2006

Patricia Shanahan wrote:
[...]

Maybe you missed my comment "Once you have the set of keys, you can sort
it using a Comparator that orders based on the associated count."?

Of course. I didn't think it likely that you would not have addressed
it. Perhaps I was thrown off because reading the above to mean to sort
the keys and I was stuck with the through that the collection of pairs
needs to be sorted.

Thank you for the explanation of performance issue.

Find and count strings of text from multiple files	17	Dec 16, 2021
Need help again please	19	Feb 14, 2020
Two Dictionaries and a Sum!	5	May 18, 2013
Concatenating regular exprs in a 'grep' call	9	Jan 19, 2009
Tasks	1	Nov 29, 2022
GET NEIL DEGRASSES TYSON, I ripped a hole with this one...	0	Nov 10, 2022
Batch modifying text - content and context based	5	Jan 19, 2023
Non-constant constant strings	561	Jan 19, 2014

Summating Strings

Mike

Patricia Shanahan

Stefan Ram

Mike

Hemal Pandya

Mike

Patricia Shanahan

Hemal Pandya

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads