email stop words

markspace · Mar 20, 2013

When indexing text files, there's a concept known as "stop words", which
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.

So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.

3888905 words
top words:

, 1730013

to, 582496
=A0, 552868
the, 544917
with, 476503
by, 451309
Received:, 398679
id, 380817
this, 324269
of, 296506
SMTP, 285885
for, 252344
from, 244664
-0700, 234140
2010, 231221a, 224202
(PDT), 220162
and, 217103
BUILD SUCCESSFUL (total time: 1 minute 8 seconds)

Arne Vajhøj · Mar 21, 2013

When indexing text files, there's a concept known as "stop words", which
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.

So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.

3888905 words
top words:
to, 582496
=A0, 552868
the, 544917
with, 476503
by, 451309
Received:, 398679
id, 380817
this, 324269
of, 296506
SMTP, 285885
for, 252344
from, 244664
-0700, 234140
2010, 231221
a, 224202
(PDT), 220162
and, 217103
BUILD SUCCESSFUL (total time: 1 minute 8 seconds)

I would have discarded special characters:

=-()

up front.

Arne

Lew · Mar 21, 2013

Arne said:
I would have discarded special characters:

up front.

You need a hyphen to spell words like "laissez-faire" and "higgledy-piggledy".

Apostrophe is a "special" character but very common in English words (most possessives, for example).

Plus-sign appears in, for example, "Google+" and "G+" and "+1".

So you need to be judicious in your definition of "special".

markspace · Mar 21, 2013

I would have discarded special characters:
up front.

It's not nearly that sophisticated yet. In time, it will find actual
words. Right now I'm just making sure I can read the whole mess in a
timely fashion. You should have seen how long it took before
ByteBuffers and using Regex.

Arne Vajhøj · Mar 21, 2013

You need a hyphen to spell words like "laissez-faire" and "higgledy-piggledy".

I would be willing to search for the two words "laissez" and "faire" and
manually sort out those cases where they were not hyphened together.

Apostrophe is a "special" character but very common in English words (most possessives, for example).

I would again prefer to search for the base word.

Plus-sign appears in, for example, "Google+" and "G+" and "+1".

I can see that point.

Arne

Joshua Cranmer ðŸ§ · Mar 21, 2013

When indexing text files, there's a concept known as "stop words", which
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.

So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.

3888905 words
top words:
to, 582496
=A0, 552868

The presence of =XX (where X is a hexadecimal character) is heavy
indication that you are failing to decode quoted-printable bodies
correctly, which makes the rest of your data suspect.

Received:, 398679

This is clearly an RFC 822 message header. You should probably be
performing some sort of normalization on message bodies to avoid parsing
this (you might be parsing, e.g., message/rfc822 as plain text or a
message/delivery-notification as plaintext.

markspace · Mar 21, 2013

When indexing text files, there's a concept known as "stop words", which
are basically really common words that you don't normally want to index.

I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.

OK, weird. I removed the auto-boxing from the HashMap I was using, and
it got MUCH slower. I'd estimate 30x slower, although I didn't let the
biggest test run to completion.

Any ideas?

Mine run to: I ended up making a lot more objects to avoid the immutable
Integer, and ended up using so much memory the garbage collector started
trashing.

Or, Integer shares low values (<127) and doesn't create new objects.
There's so many of those low numbers that this ends up saving memory,
and objects, in the long run (my biggest test case, in other words).

Other thoughts?

Stefan Ram · Mar 21, 2013

markspace said:
So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.

Jukka Lahtinen · Mar 21, 2013

markspace said:
Mine run to: I ended up making a lot more objects to avoid the immutable
Integer, and ended up using so much memory the garbage collector started
trashing.
Or, Integer shares low values (<127) and doesn't create new
objects. There's so many of those low numbers that this ends up saving
memory, and objects, in the long run (my biggest test case, in other

Do you use new Integer(x) where you should use Integer.parseInt(x)?

Fredrik Jonson · Mar 21, 2013

Stefan said:
cat mbox | egrep -o '\w+' | sort | uniq -c | sort -nr | head -9

You have a "useless use of cat" in that row of pipes. ITYM:

egrep -o '\w+' mbox | sort | uniq -c | sort -nr | head -9

http://www.partmaps.org/era/unix/award.html

Tip of the hat for the neat use of command line tools.

Eric Sosman · Mar 21, 2013

OK, weird. I removed the auto-boxing from the HashMap I was using, and
it got MUCH slower. I'd estimate 30x slower, although I didn't let the
biggest test run to completion.

Any ideas?

Mine run to: I ended up making a lot more objects to avoid the immutable
Integer, and ended up using so much memory the garbage collector started
trashing.

You didn't say exactly what your code was like, but I imagine
you were using a HashMap<Word,Integer> and processing each Word
with something along the lines of

Integer count = map.get(word);
map.put(word, count == null ? 1 : count + 1);

.... and that you switched to something more like

Integer count = map.get(word);
map.put(word, new Integer(count == null
? 1 : count.intValue() + 1);

If so, the slowdown is probably due to increased memory pressure
and garbage collection: `new' actually creates a new object every
time, while auto-boxing uses (the equivalent of) Integer.valueOf().
The latter maintains a pool of a couple hundred small-valued Integers
and doles them out whenever needed, using `new' only for un-pooled
values.

(It's probably not the frequent words that kill you, since
their counts will grow too large to be satisfied from the pool.
But the "long tail" may be murder: Using `new' you'll have many
millions of Integer objects representing small numbers, whereas
by using auto-boxing every "five" would be the same object.)

My suggestion would be to implement a Counter class that
wraps a mutable integer value. Then you'd use

HashMap<Word,Counter> map = ...;
Counter count = map.get(word);
if (count == null)
map.put(word, new Counter());
count.increment();

This way you'd create just one Counter per unique Word, instead
of one Integer for every occurrence of every word. To deal with
the long tail, make Counter extend Number and use

HashMap<Word,Number> map = ...;
Number count = map.get(word);
if (count == null) {
map.put(word, 1);
} else if (count < THRESHOLD) {
map.put(word, count + 1);
} else {
if (count == THRESHOLD) {
count = new Counter(THRESHOLD);
map.put(word, count);
}
count.increment();
}

This uses the pooled Integers as long as they last (they're created
anyhow, so they take no additional space), and then one Counter
object for each Word that appears more than THRESHOLD times.

Or, you could just go back to auto-boxing.

Joerg Meier · Mar 21, 2013

Or, Integer shares low values (<127) and doesn't create new objects.

If you are asking: yes, it does.

Liebe Gruesse,
Joerg

markspace · Mar 21, 2013

Integer count = map.get(word);
map.put(word, count == null ? 1 : count + 1);

Basically, yes.

... and that you switched to something more like

Integer count = map.get(word);
map.put(word, new Integer(count == null
? 1 : count.intValue() + 1);

No, I made a Counter with a primitive and a reference to the word:

Counter counter = map.get( word );
if( counter == null ) {
counter = new Counter();
counter.word = word;
counter.count = 1;
map.put( word, counter );
} else
counter.count++;

If so, the slowdown is probably due to increased memory pressure
and garbage collection: `new' actually creates a new object every

Yeah, that's what I thought too. Although since there's only as many
Counters as there are Strings (words), I don't get why just making a 2x
change would slow the system as horribly as it did. There should be
only 4 million Strings and therefore also 4 million Counters. I can't
figure out why that would be a problem.

time, while auto-boxing uses (the equivalent of) Integer.valueOf().
The latter maintains a pool of a couple hundred small-valued Integers
and doles them out whenever needed, using `new' only for un-pooled
values.

I think it would be worth it to change the JVM memory parameters from
the defaults and see if that makes a difference.

Also, any thoughts on the best way to observe a GC that is thrashing?
I'm really curious to pin this down to some sort of root cause. I
couldn't rule out a coding error somewhere either.

My suggestion would be to implement a Counter class that
wraps a mutable integer value. Then you'd use

Thanks, I'll take a look at this when I get a chance. A good suggestion!

Or, you could just go back to auto-boxing.

Yes, A-B-A testing works. Going back to auto-boxing restored the
previous run times, so I'm fairly certain it's related to memory
pressure or something similar.

Eric Sosman · Mar 21, 2013

Basically, yes.

No, I made a Counter with a primitive and a reference to the word:

Counter counter = map.get( word );
if( counter == null ) {
counter = new Counter();
counter.word = word;
counter.count = 1;
map.put( word, counter );
} else
counter.count++;

Yeah, that's what I thought too. Although since there's only as many
Counters as there are Strings (words), I don't get why just making a 2x
change would slow the system as horribly as it did. There should be
only 4 million Strings and therefore also 4 million Counters. I can't
figure out why that would be a problem.

It might be the "long tail" I mentioned earlier. With the
second scheme you need four million Counter objects, while the
original used (perhaps) a hundred thousand large Integers plus
3.9 million references to the few small Integers in the static pool.

Back of the envelope: The Map holds four million references
to Map.Entry objects, each of which holds a key reference, a
value reference, and a link. With the Integer original, to this
you add a hundred thousand (same out-of-thin-air figure as before)
Integer instances. Total: 16 million references, 4.1 million objects.

The change to a "word-aware" Counter adds four million more
references and 3.9 million more objects. Yeah, I can see where
that might have a teeny-tiny impact ...

Also, any thoughts on the best way to observe a GC that is thrashing?
I'm really curious to pin this down to some sort of root cause. I
couldn't rule out a coding error somewhere either.

Hmmm: I used to know something about tuning GC, but my knowledge
is about a decade out of date -- in an area that's had a lot of R&D
in the meantime. There's some Java 6 stuff at

http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html

.... but I haven't read it and can't assess it.

Thanks, I'll take a look at this when I get a chance. A good suggestion!

If I've understood you correctly, you've already done this --
and that's when the trouble started. Perhaps the hybrid Integer-
or-Counter approach would help, though.

Joshua Cranmer ðŸ§ · Mar 21, 2013

OK, weird. I removed the auto-boxing from the HashMap I was using, and
it got MUCH slower. I'd estimate 30x slower, although I didn't let the
biggest test run to completion.

Any ideas?

The JVM is optimizing autoboxed integers? Looking briefly at the code
for OpenJDK, there is definitely optimizations for autoboxing that do
things like "if foo is from the Integer cache, turn foo.value into
address_of_foo - address_of_IntegerCache"

markspace · Mar 21, 2013

The JVM is optimizing autoboxed integers?

That occurred to me, but since I know nothing about how the JVM might do
that, I didn't speculate.

Looking briefly at the code
for OpenJDK, there is definitely optimizations for autoboxing that do
things like "if foo is from the Integer cache, turn foo.value into
address_of_foo - address_of_IntegerCache"

Interesting. I mostly just increment Integers by one. If the addresses
of pre-allocated Integers (<127) are all stored sequentially, then
that's equivalent to adding a constant offset to each address, about the
same overhead as just adding 1. Hmmm, that's 100% speculation, but a
cool optimization if they do it that way.

Would like some words of wisdom - convert text files to CSV	7	Feb 25, 2010
How to send email programmatically from a gmail email a/c when port587(smtp) is blocked	5	Sep 11, 2012
Three Phases To Email Sensitivity	0	Jan 8, 2008
Question about sending email in asp.net web application	9	Apr 8, 2008
HOWTO: Parsing email using Python part1	2	Jul 3, 2011
Fully functioning email client in Python	3	Feb 6, 2011
Threading an email queueing system	1	Mar 25, 2010
[QUIZ] Banned Words (#9)	28	Nov 26, 2004

email stop words

markspace

Arne Vajhøj

Lew

markspace

Arne Vajhøj

Joshua Cranmer ðŸ§

markspace

Stefan Ram

Jukka Lahtinen

Fredrik Jonson

Eric Sosman

Joerg Meier

markspace

Eric Sosman

Joshua Cranmer ðŸ§

markspace

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads