How to find a word in wordlist

V

VP

Hi all,
I have wordlist file having format:

....
abc
abd
abf
abg
abh
....

So there is one word (utf-8) for each line and all lines were ordered by alphabet.
I would like to make a Perl script for checking a word:

# check.pl abc

if the word is found => ouput OK
if the word is not found => output two words above and below of this word. For example:
# check.pl abe => output: abd abf

The word list has about 70,000 words. I don't know how to search it quickly.

Any suggestion will be welcome.

TIA
VP
 
T

Thomas Widmann

VP said:
[...]
Any suggestion will be welcome.

E.g.,

#!/usr/local/bin/perl -w

use strict;

my $s = shift @ARGV or die "Please specify a search term!\n";

my $prev = '<BEGINNING>';
while (<>) {
chomp;
if ($_ eq $s) {
print "OK\n";
exit 0;
} elsif ($_ le $s) {
$prev = $_;
} else {
print "$prev $_\n";
exit 0;
}
}
print "$prev <END>\n";

/Thomas
 
V

VP

Thomas said:
[...]
Any suggestion will be welcome.


E.g.,

#!/usr/local/bin/perl -w

use strict;

my $s = shift @ARGV or die "Please specify a search term!\n";

my $prev = '<BEGINNING>';
while (<>) {
chomp;
if ($_ eq $s) {
print "OK\n";
exit 0;
} elsif ($_ le $s) {
$prev = $_;
} else {
print "$prev $_\n";
exit 0;
}
}
print "$prev <END>\n";

/Thomas

thank you for your reply so quickly.
With your script, if i look for a word "z***" (at the end of my 70,000 words list) so the script
should parse all terms from a->y, right?
Is there another method for quick search in this case?

Thank you,
 
J

Jürgen Exner

VP said:
I have wordlist file having format: [sorted list]
So there is one word (utf-8) for each line and all lines were ordered
by alphabet. I would like to make a Perl script for checking a word:

The typical approach to search for an element in a sorted list is to use
bisection:
- look at the middle element in the list
- is it larger than the word you are looking for? Then search again in
the lower half of the list.
- is it smaller than the word you are looking for? Then search again in
the upper half of the list
- is it the word itself? Then you are done
- are there no elements left in the list? Sorry, word does not exist in
the list.

Usually you would keep the list in a global array (used read-only) and the
boundaries for the active still-to-be-searched list are two indices into the
array.
Then it is just a matter of iterating and changing the upper or the lower
index until either the word is found or the upper index is smaller than the
lower index. BTW: In the later case you automatically got the indeces for
the two neighbour elements of your word.

For details you may want to check any book about algorithms.

jue
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Thomas Widmann said:
my $prev = '<BEGINNING>';
while (<>) {
chomp;
if ($_ eq $s) {
print "OK\n";
exit 0;
} elsif ($_ le $s) {
$prev = $_;
} else {
print "$prev $_\n";
exit 0;
}
}
print "$prev <END>\n";

The OP did say he wanted to search the list "quickly". A linear search is
not quick.

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP6Ou7GPeouIeTNHoEQIyUgCgqT+fk9VKI8VRVLEf5VQxMnXhoK0AoPMS
p1vafsNNNkKSSxypMrpf+mNx
=/7fK
-----END PGP SIGNATURE-----
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi all,
I have wordlist file having format:

...
abc
abd
abf
abg
abh
...

So there is one word (utf-8) for each line and all lines were ordered
by alphabet. I would like to make a Perl script for checking a word:

# check.pl abc

if the word is found => ouput OK
if the word is not found => output two words above and below of this
word. For example: # check.pl abe => output: abd abf

The word list has about 70,000 words. I don't know how to search it
quickly.

Any suggestion will be welcome.

The "classic" way to do this is an algorithm called a "binary search".
You locate the middle record, and then move up or down in the file
depending on whether the record you just read is less than or greater
than your search string. Then you move to the record 1/4 of the way
though the file or 3/4 of the way through the file, and so on, each time
dividing the search space by two. In this way, a list of 70,000 items
can be searched in a maximum of 17 lookups.

In Perl, if memory is not a concern (and with 70,000 items, it probably
will be), you could read the entire file into a hash. This takes lots
of memory, but thereafter, lookups are very fast. Also, the time to load
the data into the hash may be significant, so this method is only useful
if you plan to do many word lookups during a single run of the program.

Probably the best all-around solution would be to store the wordlist into
a database of some sort; perhaps a DBM file. Then you could tie a hash
to the DBM file, giving you the speed of a hash lookup without the
overhead of loading it into memory each run of your program.

HTH,

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA+AwUBP6Ow92PeouIeTNHoEQJt3QCfQ7PEVdwE3hB08a7D9X4jvMRVVNoAl20G
dXKCt1eJXlB9Kmd8SbsO2fM=
=YQKv
-----END PGP SIGNATURE-----
 
A

Anno Siegel

VP said:
Hi all,
I have wordlist file having format:

...
abc
abd
abf
abg
abh
...

So there is one word (utf-8) for each line and all lines were ordered by
alphabet.
I would like to make a Perl script for checking a word:

# check.pl abc

if the word is found => ouput OK
if the word is not found => output two words above and below of this
word. For example:
# check.pl abe => output: abd abf

The word list has about 70,000 words. I don't know how to search it quickly.

As Juergen Exner has noted, this practically shouts "binary search". The
salient point is that you want the neighboring words in case of a mismatch.
This is the one thing binary search does better than hash searching: In
binary search you are left with an approximation when the search fails,
with a hash you know nothing.

The direct application of binary search requires the whole file to be read
into memory. With 70,000 words this should be no problem.

With a large file, a space-time tradeoff is possible by storing only a
fraction (every tenth, hundredth) of the words together with their
position in the file. A single pass over the file with tell() will
create that index. The binary search goes only over the words in memory
and finds the section of the file where the word must be, if it is in
the file. A linear search over ten (a hundred) subsequent entries
decides this.

Anno
 
T

Thomas Widmann

VP said:
With your script, if i look for a word "z***" (at the end of my 70,000
words list) so the script should parse all terms from a->y, right?

Right. 70.000 lines doesn't take that long to process on my computer,
though (but YMMV).
Is there another method for quick search in this case?

It depends a bit. In your usage example, it seemed that you would run
it from the command-line with just one search term. That makes it a
bit complex to speed up. (If you'd use the program to make a lot of
queries within the same run, it'd be trivial to read the list into an
array and then use binary search as suggested by other people
elsewhere in this thread.)

If you want to have a fast look-up tool on the commandline, you have
several options.

1) Instead of reading the wordlist sequentially, you might try to use
random access to the file. Basically, you'd jump right into the
middle of the file, search forward to the next newline character,
read a line, compare it to your search term, and depending on the
result jump to somewhere else. However, this kind of low-level
file access is not really Perl's strong side.

2) You can preprocess the file into something than can be read a lot
faster.

3) You can write a client/server application, where your server reads
the wordlist once and then waits for client queries.

4) As (3), but using an database as the server, as somebody else
suggested.

For (2)-(4), you have to make additional checks to find out whether
the wordlist has changed since it was last read.

/Thomas
 
C

ctcgag

VP said:
Thomas said:
[...]
Any suggestion will be welcome.


E.g.,

#!/usr/local/bin/perl -w

use strict;

my $s = shift @ARGV or die "Please specify a search term!\n";

my $prev = '<BEGINNING>';
while (<>) {
chomp;
if ($_ eq $s) {
print "OK\n";
exit 0;
} elsif ($_ le $s) {
$prev = $_;
} else {
print "$prev $_\n";
exit 0;
}
}
print "$prev <END>\n";

/Thomas

thank you for your reply so quickly.
With your script, if i look for a word "z***" (at the end of my 70,000
words list) so the script should parse all terms from a->y, right?
Is there another method for quick search in this case?

What, a tenth of a second isn't fast enough for you?

Since your program is being started afresh each time, it is going
to have to re-read the list each time. You can't do a binary search
(or hash lookup) on the data unless you actually have the data.

A simple binary search will require the entire file to be slurped into
memory. On my system, just slurping the file into an array
(in preparation for a binary search, but without actually doing the
binary search) takes 66% more time the worst-case performance of the linear
search.


Xho
 
B

Brad Baxter

VP said:
Hi all,
I have wordlist file having format:
...
abc
abd
abf
abg
abh
...
# check.pl abc

if the word is found => ouput OK
if the word is not found => output two words above and below of this
word. For example:
# check.pl abe => output: abd abf
[snip]

The direct application of binary search requires the whole file to be read
into memory. With 70,000 words this should be no problem.

Anything wrong with using Search::Dict? And perhaps something like
File::ReadBackwards for the not-found condition (not included below)?


#!/usr/local/bin/perl
use strict;
use warnings;
use Search::Dict;

my $term = shift||die "Usage: $0 term\n";

open my $fh, 'datafile' or die $!;
look $fh, $term;

my $try = <$fh>||'';
chomp $try;

my $found = $term eq $try ? 1 : 0;
print $found?"OK\n":"Not found\n";


Regards,

Brad
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top