faster data structure to count word frequency?

K

Kevin

Hello there,

I am justing thinking of how can we do this in a faster way, if any.

Suppose we are required to count the frequency of each different word
in a large text documents, the way I currently using is like:
//////////////////
Hashtable ht = new Hashtable();
Read each word in the text in its order, for each of them:
if (ht.contains(oneWord))
{
Integer I = (Integer) ht.get(oneWord));
ht.put(oneWord, new Integer(I.intValue()+1));
}
else
{
ht.put(oneWord, new Integer(1));
};
//////////////////
Since we can not apply I++ since I is an Object, I am thinking we have
to new a new Integer each time we use it.
Is there any other way to do this process faster? Or any other data
structure we can use for this purpose?

Thanks a lot and happy weekend!
 
H

HK

Kevin said:
Suppose we are required to count the frequency of each different word
in a large text documents, the way I currently using is like:
//////////////////
Hashtable ht = new Hashtable();
Read each word in the text in its order, for each of them:
if (ht.contains(oneWord))
{
Integer I = (Integer) ht.get(oneWord));
ht.put(oneWord, new Integer(I.intValue()+1));
}
else
{
ht.put(oneWord, new Integer(1));
};

Indeed, because Integer is immutable, you would have
to create a new one each time. This is why you should
create your own (mini) class like:

private static class Count { public count=1; }

And then:
Count c = ht.get(oneWord);
if( c==null ) ht.put(oneWord, new Count());
else c.count+=1;

While it might seem cumbersome to define a whole
new class for this purpose, in all but trivial
homework assignments, this class will soon end
up having more fields.-)

BTW: If you don't really need the feature that
Hashtable is synchronized, you may want to
use
Map ht = new HashMap();
instead.

Harald.
 
K

Kevin

Thanks a lot for your suggestion.
By the way, any idea of what are the differences of the Hashtable and
the HashMap? In terms of meomory usage (overhead) and speed, supposing
we are dealing with 1 million items in the hash.
 
B

Boudewijn Dijkstra

Kevin said:
Hello there,

I am justing thinking of how can we do this in a faster way, if any.

Suppose we are required to count the frequency of each different word
in a large text documents, the way I currently using is like:
//////////////////
Hashtable ht = new Hashtable();
Read each word in the text in its order, for each of them:
if (ht.contains(oneWord))
{
Integer I = (Integer) ht.get(oneWord));
ht.put(oneWord, new Integer(I.intValue()+1));
}
else
{
ht.put(oneWord, new Integer(1));
};

This takes 2 hash lookups for each words. One of those is redundant. Also,
it creates a lot of equal Integer objects. Object creation and discarding is
often expensive.

init:
final Integer ONE = new Integer(1);

loop:
Object elm = ht.get(oneWord));
if (elm != null)
ht.put(oneWord, new Integer(((Integer) elm).intValue() + 1));
else
ht.put(oneWord, ONE);

You could take this even further by having a table of Integer objects, maybe
proportional to the size of the text. And of course a class to manage that.
 
E

el goog

This takes 2 hash lookups for each words. One of those is
redundant.

I agree here.
Also, it creates a lot of equal Integer objects. Object creation and discarding is
often expensive.

You could take this even further by having a table of Integer objects, maybe
proportional to the size of the text. And of course a class to
manage that.

Before considering such micro-optimizations, you should determine
whether
you really need it to be faster or use less memory, and if so, what is
your
bottleneck (object creation or hash table ops).

If it's the hash table ops that are the bottleneck, you could store the
strings in
an array, 3-way radix quicksort them (Reference: Sedgewick Algorithms
in Java)
and compute the frequency counts afterwards (or during the sort for
further
speedup). I suspect this will outperform hashing and use less memory.
If you
need it to interleave insert and count operations, you can use a
ternary search
trie (Reference: Sedgewick Algorithms in Java).

So, if you really need to optimize performance, there are plenty of
things
you can do. But don't bother unless you really need to.
 
E

el goog

I suspect this will outperform hashing and use less memory.

Clarification: the sorting approach may actually use more
memory if there are lots of duplicates since it stores each
string. On the other hand, the TST approach will use less
memory if there are lots of string prefixes in common.
 
L

Lee Fesperman

Boudewijn said:
Object creation and discarding is often expensive.

That is often not true in Java. Object allocation may consist of little more than
incrementing a pointer. Discarding an object has no (extra) cost at all. There may be
some cost in garbage collection (if it is even done), but modern JVMs are very good at
ameliorating that across all eligible objects. It's a win, win, win ;^)
 
P

Patricia Shanahan

Lee said:
That is often not true in Java. Object allocation may
consist of little more than incrementing a pointer.
Discarding an object has no (extra) cost at all. There
may be some cost in garbage collection (if it is even
done), but modern JVMs are very good at ameliorating that
across all eligible objects. It's a win, win, win ;^)

Agreed. However, I don't see any downside to using
Integer.valueOf(i) instead of new Integer(i).

Integer.valueOf tells the Integer class that you don't
require a new instance, as long as you get one with the
value you want. Caching in the Integer class is a much
better deal than caching in a particular use, because
several parts of the program can use the same Integer
object, without having to coordinate.

Patricia
 
L

Lee Fesperman

Patricia said:
Agreed. However, I don't see any downside to using
Integer.valueOf(i) instead of new Integer(i).

Integer.valueOf tells the Integer class that you don't
require a new instance, as long as you get one with the
value you want. Caching in the Integer class is a much
better deal than caching in a particular use, because
several parts of the program can use the same Integer
object, without having to coordinate.

Agreed. valueOf() is almost always the best choice for the standard wrapper classes. I
love that method. It's not really premature optimization to choose the best method.
 
C

Chris Uppal

Kevin said:
Suppose we are required to count the frequency of each different word
in a large text documents,

Since you mention that you are dealing with large documents, I will assume that
you are dealing with 100s of megabytes (since otherwise there'd be little point
in the kinds of optimisation you are considering).

So, some thoughts. Note that none of these are backed by measurements, they
are just suggestions of things you might want /to/ measure.

First off. The disk read time for large (collections of) documents is
significant. Quite possibly more so than all the processing you will
subsequently do. If that is the case, then the only way to speed up the
processing is to reduce the amount of data you read (e.g. you could try keeping
the data on-disk in compressed form), or buy faster disks. BTW, when you come
to measure this, remember that the OS will cache small (10s of megabytes) files
pretty effectively, so its hard to get a feel for what the real disk IO time
is; and impossible to measure /anything/ by looping reading the same small file
several times.

Second. In the old days, file IO was often dominated (after the disk IO times)
by the time it took to copy the data around in program-space. There were
several versions of 'C' <stdio> that provided real gains by avoiding that
overhead (I wrote one myself). I don't know to what extent such considerations
are still relevant on modern machines (they'll definitely be /less/ relevant).

Third. Your main loop (once you've got the data into memory) is to

create a string (two object creates plus a copy [*])
do a hash table lookup
do the addition
do a second, identical, hash table lookup

which leaves considerable scope for optimisation since most of that work is
wasted. You could create a special kind of hash table that could hash and find
a substring of an existing string (or a subsequence of a char array), thus
avoiding the expense of creating 10s of millions of String objects that would
(mostly) be immediately discarded (assuming that most strings would turn out to
be in the hash table already). You would ensure that the values in the hash
table were 'int's (or maybe 'long's), since if you are creating your own hash
table it would be a significant waste of time and space (/and/ add unnecessary
complexity) to mess around with Integer values. You would include in that hash
table implementation an operation like incrementCountAt(), so that you could
avoid the double lookup.

All of that would be moderately complicated to implement (although it would
also be moderately interesting), so before you do that you should:
create a test collection of representatively large documents
create a test program that counts the words using the simplest approach.
time it.
Then comment out the bit that actually counts the words (leaving just the IO),
and time the cut-down version.
How much difference is there ? If it's small (as it might be), then you'd just
be wasting your time trying to optimise the word counting.

-- chris

([*] though the buffer-sharing that Strings do needs to be taken into account
here too)
 
J

John C. Bollinger

Lee said:
That is often not true in Java. Object allocation may consist of little more than
incrementing a pointer. Discarding an object has no (extra) cost at all. There may be
some cost in garbage collection (if it is even done), but modern JVMs are very good at
ameliorating that across all eligible objects. It's a win, win, win ;^)

You're quite right, but if a simple integer operation can be substituted
for creation of a new object then that's yet better. Allocating a new
object will never, ever perform better than incrementing an int.
 
L

Lee Fesperman

John said:
You're quite right, but if a simple integer operation can be substituted
for creation of a new object then that's yet better. Allocating a new
object will never, ever perform better than incrementing an int.

Quite right. I was mostly concerned that Boudewijn was applying non-Java concepts to
object creation/destruction thus perpetrating myths. That's why I was somewhat forceful
in my counter.

Now, I do need to say that all is not totally rosy. There are some subtle aspects.
For instance, there is extra pressure on memory usage, affecting other processes and
perhaps forcing additional GC cycles. However, modern generational garbage collection
does a great job of reducing cost of GC. The effects on other processes may not be
avoidable, therefore it's not good to be cavalier about this. I also used 'may' several
times in my original assertion, so there are smaller tradeoffs.
 
B

Boudewijn Dijkstra

Quite right. I was mostly concerned that Boudewijn was applying non-Java
concepts to object creation/destruction thus perpetrating myths. That's why
I was somewhat forceful in my counter.

I will try to be aware of any future possible myth-perpetrating statements.
 
R

Richard Wheeldon

Kevin said:
Thanks a lot for your suggestion.
By the way, any idea of what are the differences of the Hashtable and
the HashMap? In terms of meomory usage (overhead) and speed, supposing
we are dealing with 1 million items in the hash.

Hashtable is synchronized. HashMap isn't. The differences are pretty
much the same as those between Vector and ArrayList. If you've only
got one thread use HashMap,

Richard
 
D

Dale King

Kevin said:
Hello there,

I am justing thinking of how can we do this in a faster way, if any.

Suppose we are required to count the frequency of each different word
in a large text documents, the way I currently using is like:

The best data structure for this sort of thing is the trie data
structure. Java has no standard implementation as it is a very
specialized structure.

You will have to roll your own or perhaps you can find a Java
implementation.

See <http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/> for an
explanation of tries.

In your case a PATRICIA trie would be the most likely candidate.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top