M
markspace
When indexing text files, there's a concept known as "stop words", which
are basically really common words that you don't normally want to index.
I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.
So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.
3888905 words
top words:
=A0, 552868
the, 544917
with, 476503
by, 451309
Received:, 398679
id, 380817
this, 324269
of, 296506
SMTP, 285885
for, 252344
from, 244664
-0700, 234140
2010, 231221a, 224202
(PDT), 220162
and, 217103
BUILD SUCCESSFUL (total time: 1 minute 8 seconds)
are basically really common words that you don't normally want to index.
I just got done with a preliminary part of a project, where I indexed my
gmail inbox by parsing out all the white-space separated words from all
of my emails. For what it's worth, here's the 19 most common words in
my inbox, out of over 600 million characters, nearly 4 million words,
and probably almost 400,000 email messages.
So what I have here is a list of stop words for email. Here it is
without further ado, enjoy.
3888905 words
top words:
to, 582496, 1730013
=A0, 552868
the, 544917
with, 476503
by, 451309
Received:, 398679
id, 380817
this, 324269
of, 296506
SMTP, 285885
for, 252344
from, 244664
-0700, 234140
2010, 231221a, 224202
(PDT), 220162
and, 217103
BUILD SUCCESSFUL (total time: 1 minute 8 seconds)