Script to find unique words in a document

M

Mike

I have a need to establish a lexicon for a surgical environment. I
have approximately 20,000,000 lines of text from which I need to
determine unique "words". For the time-being, a "word" is
space-delimited (excluding punctuation, etc). I have done some
searching, but don't see anything obvious as far as pre-existing
scripts for doing this, but figure such a beast must have been created
before. Would anyone have any suggestions? I can create a single file
and read from that file.

Thanks much,

Mike
 
D

Dr.Ruud

Mike schreef:
I have a need to establish a lexicon for a surgical environment. I
have approximately 20,000,000 lines of text from which I need to
determine unique "words". For the time-being, a "word" is
space-delimited (excluding punctuation, etc). I have done some
searching, but don't see anything obvious as far as pre-existing
scripts for doing this, but figure such a beast must have been created
before. Would anyone have any suggestions? I can create a single file
and read from that file.

No Perl required.

If you put each word on its own line, for instance by using 'tr' to
replace each space and other word separating character by a \n, then the
file becomes usable for 'sort -u' or 'sort | uniq'.

There might be special characters that you want to keep, such as quote
or hyphen.
 
P

Paul Lalli

Mike said:
I have a need to establish a lexicon for a surgical environment. I
have approximately 20,000,000 lines of text from which I need to
determine unique "words". For the time-being, a "word" is
space-delimited (excluding punctuation, etc). I have done some
searching, but don't see anything obvious as far as pre-existing
scripts for doing this, but figure such a beast must have been created
before.

If it has, this is not the place to look for it. This newsgroup deals
with helping people write and debug their own scripts, not to find
preexisting scripts.
Would anyone have any suggestions? I can create a single file
and read from that file.

Check the Perl FAQ:
$ perldoc -q word-frequency
Found in /opt2/Perl5_8_4/lib/perl5/5.8.4/pod/perlfaq6.pod
How can I print out a word-frequency or line-frequency
summary?

Modify that example to instead print out only those words found to have
a frequency of one.

Paul Lalli
 
A

axel

Mike said:
I have a need to establish a lexicon for a surgical environment. I
have approximately 20,000,000 lines of text from which I need to
determine unique "words". For the time-being, a "word" is
space-delimited (excluding punctuation, etc). I have done some
searching, but don't see anything obvious as far as pre-existing
scripts for doing this, but figure such a beast must have been created
before.

From your specification it is such a trivial task (about 17 lines of
code... including blank lines for readability for the total program) that nobody would
bother to save something like that as a specific script unless they
required repeated use of it.
Would anyone have any suggestions? I can create a single file
and read from that file.

Read in the files line by line, split into words, save individual words
in a hash.

Axel
 
V

Vilmos Soti

Dr.Ruud said:
Mike schreef:


No Perl required.

If you put each word on its own line, for instance by using 'tr' to
replace each space and other word separating character by a \n, then the
file becomes usable for 'sort -u' or 'sort | uniq'.

This is also how I would do except one problem. The OP mentioned
that he has approx. 20 million lines of text. If we suppose that
each line has 10-15 words, then we have 200-300 million lines to
sort. That can easily be 1 GB. Sort will be very slow on that.

Vilmos
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top