email stop words

M

markspace

When indexing text files, there's a concept known as "stop words", which
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.

So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.

3888905 words
top words:
, 1730013
to, 582496
=A0, 552868
the, 544917
with, 476503
by, 451309
Received:, 398679
id, 380817
this, 324269
of, 296506
SMTP, 285885
for, 252344
from, 244664
-0700, 234140
2010, 231221a, 224202
(PDT), 220162
and, 217103
BUILD SUCCESSFUL (total time: 1 minute 8 seconds)
 
A

Arne Vajhøj

When indexing text files, there's a concept known as "stop words", which
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.

So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.

3888905 words
top words:
to, 582496
=A0, 552868
the, 544917
with, 476503
by, 451309
Received:, 398679
id, 380817
this, 324269
of, 296506
SMTP, 285885
for, 252344
from, 244664
-0700, 234140
2010, 231221
a, 224202
(PDT), 220162
and, 217103
BUILD SUCCESSFUL (total time: 1 minute 8 seconds)

I would have discarded special characters:
up front.

Arne
 
L

Lew

Arne said:
I would have discarded special characters:


up front.

You need a hyphen to spell words like "laissez-faire" and "higgledy-piggledy".

Apostrophe is a "special" character but very common in English words (most possessives, for example).

Plus-sign appears in, for example, "Google+" and "G+" and "+1".

So you need to be judicious in your definition of "special".
 
M

markspace

I would have discarded special characters:
up front.

It's not nearly that sophisticated yet. In time, it will find actual
words. Right now I'm just making sure I can read the whole mess in a
timely fashion. You should have seen how long it took before
ByteBuffers and using Regex.
 
A

Arne Vajhøj

You need a hyphen to spell words like "laissez-faire" and "higgledy-piggledy".

I would be willing to search for the two words "laissez" and "faire" and
manually sort out those cases where they were not hyphened together.
Apostrophe is a "special" character but very common in English words (most possessives, for example).

I would again prefer to search for the base word.
Plus-sign appears in, for example, "Google+" and "G+" and "+1".

I can see that point.

Arne
 
J

Joshua Cranmer ðŸ§

When indexing text files, there's a concept known as "stop words", which
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.

So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.

3888905 words
top words:
to, 582496
=A0, 552868

The presence of =XX (where X is a hexadecimal character) is heavy
indication that you are failing to decode quoted-printable bodies
correctly, which makes the rest of your data suspect.
Received:, 398679

This is clearly an RFC 822 message header. You should probably be
performing some sort of normalization on message bodies to avoid parsing
this (you might be parsing, e.g., message/rfc822 as plain text or a
message/delivery-notification as plaintext.
 
M

markspace

When indexing text files, there's a concept known as "stop words", which
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.


OK, weird. I removed the auto-boxing from the HashMap I was using, and
it got MUCH slower. I'd estimate 30x slower, although I didn't let the
biggest test run to completion.

Any ideas?

Mine run to: I ended up making a lot more objects to avoid the immutable
Integer, and ended up using so much memory the garbage collector started
trashing.

Or, Integer shares low values (<127) and doesn't create new objects.
There's so many of those low numbers that this ends up saving memory,
and objects, in the long run (my biggest test case, in other words).

Other thoughts?
 
S

Stefan Ram

markspace said:
So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.

cat mbox | egrep -o '\w+' | sort | uniq -c | sort -nr | head -9

Now, lets do it for this newsgroup!

55974 the
40512 to
33308 a
28750 of
26663 news
25780 I
24471 and
24276 is
21312 for

(Some lines removed manually.)
 
J

Jukka Lahtinen

markspace said:
Mine run to: I ended up making a lot more objects to avoid the immutable
Integer, and ended up using so much memory the garbage collector started
trashing.
Or, Integer shares low values (<127) and doesn't create new
objects. There's so many of those low numbers that this ends up saving
memory, and objects, in the long run (my biggest test case, in other

Do you use new Integer(x) where you should use Integer.parseInt(x)?
 
E

Eric Sosman

OK, weird. I removed the auto-boxing from the HashMap I was using, and
it got MUCH slower. I'd estimate 30x slower, although I didn't let the
biggest test run to completion.

Any ideas?

Mine run to: I ended up making a lot more objects to avoid the immutable
Integer, and ended up using so much memory the garbage collector started
trashing.

You didn't say exactly what your code was like, but I imagine
you were using a HashMap<Word,Integer> and processing each Word
with something along the lines of

Integer count = map.get(word);
map.put(word, count == null ? 1 : count + 1);

.... and that you switched to something more like

Integer count = map.get(word);
map.put(word, new Integer(count == null
? 1 : count.intValue() + 1);

If so, the slowdown is probably due to increased memory pressure
and garbage collection: `new' actually creates a new object every
time, while auto-boxing uses (the equivalent of) Integer.valueOf().
The latter maintains a pool of a couple hundred small-valued Integers
and doles them out whenever needed, using `new' only for un-pooled
values.

(It's probably not the frequent words that kill you, since
their counts will grow too large to be satisfied from the pool.
But the "long tail" may be murder: Using `new' you'll have many
millions of Integer objects representing small numbers, whereas
by using auto-boxing every "five" would be the same object.)

My suggestion would be to implement a Counter class that
wraps a mutable integer value. Then you'd use

HashMap<Word,Counter> map = ...;
Counter count = map.get(word);
if (count == null)
map.put(word, new Counter());
count.increment();

This way you'd create just one Counter per unique Word, instead
of one Integer for every occurrence of every word. To deal with
the long tail, make Counter extend Number and use

HashMap<Word,Number> map = ...;
Number count = map.get(word);
if (count == null) {
map.put(word, 1);
} else if (count < THRESHOLD) {
map.put(word, count + 1);
} else {
if (count == THRESHOLD) {
count = new Counter(THRESHOLD);
map.put(word, count);
}
count.increment();
}

This uses the pooled Integers as long as they last (they're created
anyhow, so they take no additional space), and then one Counter
object for each Word that appears more than THRESHOLD times.

Or, you could just go back to auto-boxing.
 
M

markspace

Integer count = map.get(word);
map.put(word, count == null ? 1 : count + 1);

Basically, yes.
... and that you switched to something more like

Integer count = map.get(word);
map.put(word, new Integer(count == null
? 1 : count.intValue() + 1);

No, I made a Counter with a primitive and a reference to the word:

Counter counter = map.get( word );
if( counter == null ) {
counter = new Counter();
counter.word = word;
counter.count = 1;
map.put( word, counter );
} else
counter.count++;
If so, the slowdown is probably due to increased memory pressure
and garbage collection: `new' actually creates a new object every

Yeah, that's what I thought too. Although since there's only as many
Counters as there are Strings (words), I don't get why just making a 2x
change would slow the system as horribly as it did. There should be
only 4 million Strings and therefore also 4 million Counters. I can't
figure out why that would be a problem.
time, while auto-boxing uses (the equivalent of) Integer.valueOf().
The latter maintains a pool of a couple hundred small-valued Integers
and doles them out whenever needed, using `new' only for un-pooled
values.

I think it would be worth it to change the JVM memory parameters from
the defaults and see if that makes a difference.

Also, any thoughts on the best way to observe a GC that is thrashing?
I'm really curious to pin this down to some sort of root cause. I
couldn't rule out a coding error somewhere either.
My suggestion would be to implement a Counter class that
wraps a mutable integer value. Then you'd use

Thanks, I'll take a look at this when I get a chance. A good suggestion!
Or, you could just go back to auto-boxing.

Yes, A-B-A testing works. Going back to auto-boxing restored the
previous run times, so I'm fairly certain it's related to memory
pressure or something similar.
 
E

Eric Sosman

Basically, yes.


No, I made a Counter with a primitive and a reference to the word:

Counter counter = map.get( word );
if( counter == null ) {
counter = new Counter();
counter.word = word;
counter.count = 1;
map.put( word, counter );
} else
counter.count++;


Yeah, that's what I thought too. Although since there's only as many
Counters as there are Strings (words), I don't get why just making a 2x
change would slow the system as horribly as it did. There should be
only 4 million Strings and therefore also 4 million Counters. I can't
figure out why that would be a problem.

It might be the "long tail" I mentioned earlier. With the
second scheme you need four million Counter objects, while the
original used (perhaps) a hundred thousand large Integers plus
3.9 million references to the few small Integers in the static pool.

Back of the envelope: The Map holds four million references
to Map.Entry objects, each of which holds a key reference, a
value reference, and a link. With the Integer original, to this
you add a hundred thousand (same out-of-thin-air figure as before)
Integer instances. Total: 16 million references, 4.1 million objects.

The change to a "word-aware" Counter adds four million more
references and 3.9 million more objects. Yeah, I can see where
that might have a teeny-tiny impact ...
Also, any thoughts on the best way to observe a GC that is thrashing?
I'm really curious to pin this down to some sort of root cause. I
couldn't rule out a coding error somewhere either.

Hmmm: I used to know something about tuning GC, but my knowledge
is about a decade out of date -- in an area that's had a lot of R&D
in the meantime. There's some Java 6 stuff at

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

.... but I haven't read it and can't assess it.
Thanks, I'll take a look at this when I get a chance. A good suggestion!

If I've understood you correctly, you've already done this --
and that's when the trouble started. Perhaps the hybrid Integer-
or-Counter approach would help, though.
 
J

Joshua Cranmer ðŸ§

OK, weird. I removed the auto-boxing from the HashMap I was using, and
it got MUCH slower. I'd estimate 30x slower, although I didn't let the
biggest test run to completion.

Any ideas?

The JVM is optimizing autoboxed integers? Looking briefly at the code
for OpenJDK, there is definitely optimizations for autoboxing that do
things like "if foo is from the Integer cache, turn foo.value into
address_of_foo - address_of_IntegerCache"
 
M

markspace

The JVM is optimizing autoboxed integers?

That occurred to me, but since I know nothing about how the JVM might do
that, I didn't speculate.
Looking briefly at the code
for OpenJDK, there is definitely optimizations for autoboxing that do
things like "if foo is from the Integer cache, turn foo.value into
address_of_foo - address_of_IntegerCache"

Interesting. I mostly just increment Integers by one. If the addresses
of pre-allocated Integers (<127) are all stored sequentially, then
that's equivalent to adding a constant offset to each address, about the
same overhead as just adding 1. Hmmm, that's 100% speculation, but a
cool optimization if they do it that way.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top