Script to find unique words in a document

Mike · Jul 26, 2006

I have a need to establish a lexicon for a surgical environment. I
have approximately 20,000,000 lines of text from which I need to
determine unique "words". For the time-being, a "word" is
space-delimited (excluding punctuation, etc). I have done some
searching, but don't see anything obvious as far as pre-existing
scripts for doing this, but figure such a beast must have been created
before. Would anyone have any suggestions? I can create a single file
and read from that file.

Thanks much,

Mike

Dr.Ruud · Jul 26, 2006

Mike schreef:

I have a need to establish a lexicon for a surgical environment. I
have approximately 20,000,000 lines of text from which I need to
determine unique "words". For the time-being, a "word" is
space-delimited (excluding punctuation, etc). I have done some
searching, but don't see anything obvious as far as pre-existing
scripts for doing this, but figure such a beast must have been created
before. Would anyone have any suggestions? I can create a single file
and read from that file.

No Perl required.

If you put each word on its own line, for instance by using 'tr' to
replace each space and other word separating character by a \n, then the
file becomes usable for 'sort -u' or 'sort | uniq'.

There might be special characters that you want to keep, such as quote
or hyphen.

Paul Lalli · Jul 26, 2006

Mike said:
I have a need to establish a lexicon for a surgical environment. I
have approximately 20,000,000 lines of text from which I need to
determine unique "words". For the time-being, a "word" is
space-delimited (excluding punctuation, etc). I have done some
searching, but don't see anything obvious as far as pre-existing
scripts for doing this, but figure such a beast must have been created
before.

If it has, this is not the place to look for it. This newsgroup deals
with helping people write and debug their own scripts, not to find
preexisting scripts.

Would anyone have any suggestions? I can create a single file
and read from that file.

Check the Perl FAQ:
$ perldoc -q word-frequency
Found in /opt2/Perl5_8_4/lib/perl5/5.8.4/pod/perlfaq6.pod
How can I print out a word-frequency or line-frequency
summary?

Modify that example to instead print out only those words found to have
a frequency of one.

Paul Lalli

Mike · Jul 26, 2006

Thanks much,

Mike

axel · Jul 26, 2006

Mike said:
I have a need to establish a lexicon for a surgical environment. I
have approximately 20,000,000 lines of text from which I need to
determine unique "words". For the time-being, a "word" is
space-delimited (excluding punctuation, etc). I have done some
searching, but don't see anything obvious as far as pre-existing
scripts for doing this, but figure such a beast must have been created
before.

From your specification it is such a trivial task (about 17 lines of
code... including blank lines for readability for the total program) that nobody would
bother to save something like that as a specific script unless they
required repeated use of it.

Would anyone have any suggestions? I can create a single file
and read from that file.

Read in the files line by line, split into words, save individual words
in a hash.

Axel

Vilmos Soti · Jul 27, 2006

Dr.Ruud said:
Mike schreef:

No Perl required.

If you put each word on its own line, for instance by using 'tr' to
replace each space and other word separating character by a \n, then the
file becomes usable for 'sort -u' or 'sort | uniq'.

This is also how I would do except one problem. The OP mentioned
that he has approx. 20 million lines of text. If we suppose that
each line has 10-15 words, then we have 200-300 million lines to
sort. That can easily be 1 GB. Sort will be very slow on that.

Vilmos

Dr.Ruud · Jul 27, 2006

Vilmos Soti schreef:

[tr, sort]
1 GB. Sort will be very slow on that.

sort is quite fast. And it can work with presorted files, see the -m
option.

A script to flag commonly misused words	2	Jul 31, 2007
Iterate through words in a string	6	Jan 4, 2007
Script that gives instance count unique patterns in a sorted file	9	May 4, 2006
counting words in a file	10	Nov 26, 2004
replace some words in a file with a perl script.	2	Dec 12, 2004
How to find all the same words in a text?	10	Feb 10, 2007
I Need Help with making a function that draws in a canvas using location data.	1	Dec 17, 2021
Working with Duplicates in Perl to generate Unique ID	22	Jun 17, 2005

Script to find unique words in a document

Mike

Dr.Ruud

Paul Lalli

Mike

axel

Vilmos Soti

Dr.Ruud

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads