How to find a word in wordlist

Discussion in 'Perl Misc' started by VP, Nov 1, 2003.

  1. VP

    VP Guest

    Hi all,
    I have wordlist file having format:

    ....
    abc
    abd
    abf
    abg
    abh
    ....

    So there is one word (utf-8) for each line and all lines were ordered by alphabet.
    I would like to make a Perl script for checking a word:

    # check.pl abc

    if the word is found => ouput OK
    if the word is not found => output two words above and below of this word. For example:
    # check.pl abe => output: abd abf

    The word list has about 70,000 words. I don't know how to search it quickly.

    Any suggestion will be welcome.

    TIA
    VP
     
    VP, Nov 1, 2003
    #1
    1. Advertising

  2. VP <> writes:

    > [...]
    > Any suggestion will be welcome.


    E.g.,

    #!/usr/local/bin/perl -w

    use strict;

    my $s = shift @ARGV or die "Please specify a search term!\n";

    my $prev = '<BEGINNING>';
    while (<>) {
    chomp;
    if ($_ eq $s) {
    print "OK\n";
    exit 0;
    } elsif ($_ le $s) {
    $prev = $_;
    } else {
    print "$prev $_\n";
    exit 0;
    }
    }
    print "$prev <END>\n";

    /Thomas
    --
    Thomas Widmann http://www.twid.bibulus.org
    Flat 3/2, 54 Mavisbank Gardens, Glasgow G51 1HL, Scotland, EU
    *** Ny gruppe om nordiske sprog: europa.linguas.germanic.nord ***
     
    Thomas Widmann, Nov 1, 2003
    #2
    1. Advertising

  3. VP

    VP Guest

    Thomas Widmann wrote:
    > VP <> writes:
    >
    >
    >>[...]
    >>Any suggestion will be welcome.

    >
    >
    > E.g.,
    >
    > #!/usr/local/bin/perl -w
    >
    > use strict;
    >
    > my $s = shift @ARGV or die "Please specify a search term!\n";
    >
    > my $prev = '<BEGINNING>';
    > while (<>) {
    > chomp;
    > if ($_ eq $s) {
    > print "OK\n";
    > exit 0;
    > } elsif ($_ le $s) {
    > $prev = $_;
    > } else {
    > print "$prev $_\n";
    > exit 0;
    > }
    > }
    > print "$prev <END>\n";
    >
    > /Thomas


    thank you for your reply so quickly.
    With your script, if i look for a word "z***" (at the end of my 70,000 words list) so the script
    should parse all terms from a->y, right?
    Is there another method for quick search in this case?

    Thank you,
     
    VP, Nov 1, 2003
    #3
  4. VP wrote:
    > I have wordlist file having format:

    [sorted list]
    > So there is one word (utf-8) for each line and all lines were ordered
    > by alphabet. I would like to make a Perl script for checking a word:


    The typical approach to search for an element in a sorted list is to use
    bisection:
    - look at the middle element in the list
    - is it larger than the word you are looking for? Then search again in
    the lower half of the list.
    - is it smaller than the word you are looking for? Then search again in
    the upper half of the list
    - is it the word itself? Then you are done
    - are there no elements left in the list? Sorry, word does not exist in
    the list.

    Usually you would keep the list in a global array (used read-only) and the
    boundaries for the active still-to-be-searched list are two indices into the
    array.
    Then it is just a matter of iterating and changing the upper or the lower
    index until either the word is found or the upper index is smaller than the
    lower index. BTW: In the later case you automatically got the indeces for
    the two neighbour elements of your word.

    For details you may want to check any book about algorithms.

    jue
     
    Jürgen Exner, Nov 1, 2003
    #4
  5. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    Thomas Widmann <> wrote in news::

    > VP <> writes:
    >
    > my $prev = '<BEGINNING>';
    > while (<>) {
    > chomp;
    > if ($_ eq $s) {
    > print "OK\n";
    > exit 0;
    > } elsif ($_ le $s) {
    > $prev = $_;
    > } else {
    > print "$prev $_\n";
    > exit 0;
    > }
    > }
    > print "$prev <END>\n";


    The OP did say he wanted to search the list "quickly". A linear search is
    not quick.

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA/AwUBP6Ou7GPeouIeTNHoEQIyUgCgqT+fk9VKI8VRVLEf5VQxMnXhoK0AoPMS
    p1vafsNNNkKSSxypMrpf+mNx
    =/7fK
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Nov 1, 2003
    #5
  6. -----BEGIN PGP SIGNED MESSAGE-----
    Hash: SHA1

    VP <> wrote in
    news:3fa37c8c$0$1854$:

    > Hi all,
    > I have wordlist file having format:
    >
    > ...
    > abc
    > abd
    > abf
    > abg
    > abh
    > ...
    >
    > So there is one word (utf-8) for each line and all lines were ordered
    > by alphabet. I would like to make a Perl script for checking a word:
    >
    > # check.pl abc
    >
    > if the word is found => ouput OK
    > if the word is not found => output two words above and below of this
    > word. For example: # check.pl abe => output: abd abf
    >
    > The word list has about 70,000 words. I don't know how to search it
    > quickly.
    >
    > Any suggestion will be welcome.


    The "classic" way to do this is an algorithm called a "binary search".
    You locate the middle record, and then move up or down in the file
    depending on whether the record you just read is less than or greater
    than your search string. Then you move to the record 1/4 of the way
    though the file or 3/4 of the way through the file, and so on, each time
    dividing the search space by two. In this way, a list of 70,000 items
    can be searched in a maximum of 17 lookups.

    In Perl, if memory is not a concern (and with 70,000 items, it probably
    will be), you could read the entire file into a hash. This takes lots
    of memory, but thereafter, lookups are very fast. Also, the time to load
    the data into the hash may be significant, so this method is only useful
    if you plan to do many word lookups during a single run of the program.

    Probably the best all-around solution would be to store the wordlist into
    a database of some sort; perhaps a DBM file. Then you could tie a hash
    to the DBM file, giving you the speed of a hash lookup without the
    overhead of loading it into memory each run of your program.

    HTH,

    - --
    Eric
    $_ = reverse sort $ /. r , qw p ekca lre uJ reh
    ts p , map $ _. $ " , qw e p h tona e and print

    -----BEGIN PGP SIGNATURE-----
    Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

    iQA+AwUBP6Ow92PeouIeTNHoEQJt3QCfQ7PEVdwE3hB08a7D9X4jvMRVVNoAl20G
    dXKCt1eJXlB9Kmd8SbsO2fM=
    =YQKv
    -----END PGP SIGNATURE-----
     
    Eric J. Roode, Nov 1, 2003
    #6
  7. VP

    Anno Siegel Guest

    VP <> wrote in comp.lang.perl.misc:
    > Hi all,
    > I have wordlist file having format:
    >
    > ...
    > abc
    > abd
    > abf
    > abg
    > abh
    > ...
    >
    > So there is one word (utf-8) for each line and all lines were ordered by
    > alphabet.
    > I would like to make a Perl script for checking a word:
    >
    > # check.pl abc
    >
    > if the word is found => ouput OK
    > if the word is not found => output two words above and below of this
    > word. For example:
    > # check.pl abe => output: abd abf
    >
    > The word list has about 70,000 words. I don't know how to search it quickly.


    As Juergen Exner has noted, this practically shouts "binary search". The
    salient point is that you want the neighboring words in case of a mismatch.
    This is the one thing binary search does better than hash searching: In
    binary search you are left with an approximation when the search fails,
    with a hash you know nothing.

    The direct application of binary search requires the whole file to be read
    into memory. With 70,000 words this should be no problem.

    With a large file, a space-time tradeoff is possible by storing only a
    fraction (every tenth, hundredth) of the words together with their
    position in the file. A single pass over the file with tell() will
    create that index. The binary search goes only over the words in memory
    and finds the section of the file where the word must be, if it is in
    the file. A linear search over ten (a hundred) subsequent entries
    decides this.

    Anno
     
    Anno Siegel, Nov 1, 2003
    #7
  8. VP <> writes:

    > With your script, if i look for a word "z***" (at the end of my 70,000
    > words list) so the script should parse all terms from a->y, right?


    Right. 70.000 lines doesn't take that long to process on my computer,
    though (but YMMV).

    > Is there another method for quick search in this case?


    It depends a bit. In your usage example, it seemed that you would run
    it from the command-line with just one search term. That makes it a
    bit complex to speed up. (If you'd use the program to make a lot of
    queries within the same run, it'd be trivial to read the list into an
    array and then use binary search as suggested by other people
    elsewhere in this thread.)

    If you want to have a fast look-up tool on the commandline, you have
    several options.

    1) Instead of reading the wordlist sequentially, you might try to use
    random access to the file. Basically, you'd jump right into the
    middle of the file, search forward to the next newline character,
    read a line, compare it to your search term, and depending on the
    result jump to somewhere else. However, this kind of low-level
    file access is not really Perl's strong side.

    2) You can preprocess the file into something than can be read a lot
    faster.

    3) You can write a client/server application, where your server reads
    the wordlist once and then waits for client queries.

    4) As (3), but using an database as the server, as somebody else
    suggested.

    For (2)-(4), you have to make additional checks to find out whether
    the wordlist has changed since it was last read.

    /Thomas
    --
    Thomas Widmann http://www.twid.bibulus.org
    Flat 3/2, 54 Mavisbank Gardens, Glasgow G51 1HL, Scotland, EU
    *** Ny gruppe om nordiske sprog: europa.linguas.germanic.nord ***
     
    Thomas Widmann, Nov 1, 2003
    #8
  9. VP

    Guest

    VP <> wrote:
    > Thomas Widmann wrote:
    > > VP <> writes:
    > >
    > >
    > >>[...]
    > >>Any suggestion will be welcome.

    > >
    > >
    > > E.g.,
    > >
    > > #!/usr/local/bin/perl -w
    > >
    > > use strict;
    > >
    > > my $s = shift @ARGV or die "Please specify a search term!\n";
    > >
    > > my $prev = '<BEGINNING>';
    > > while (<>) {
    > > chomp;
    > > if ($_ eq $s) {
    > > print "OK\n";
    > > exit 0;
    > > } elsif ($_ le $s) {
    > > $prev = $_;
    > > } else {
    > > print "$prev $_\n";
    > > exit 0;
    > > }
    > > }
    > > print "$prev <END>\n";
    > >
    > > /Thomas

    >
    > thank you for your reply so quickly.
    > With your script, if i look for a word "z***" (at the end of my 70,000
    > words list) so the script should parse all terms from a->y, right?
    > Is there another method for quick search in this case?


    What, a tenth of a second isn't fast enough for you?

    Since your program is being started afresh each time, it is going
    to have to re-read the list each time. You can't do a binary search
    (or hash lookup) on the data unless you actually have the data.

    A simple binary search will require the entire file to be slurped into
    memory. On my system, just slurping the file into an array
    (in preparation for a binary search, but without actually doing the
    binary search) takes 66% more time the worst-case performance of the linear
    search.


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service New Rate! $9.95/Month 50GB
     
    , Nov 1, 2003
    #9
  10. VP

    Brad Baxter Guest

    On Sat, 1 Nov 2003, Anno Siegel wrote:

    > VP <> wrote in comp.lang.perl.misc:
    > > Hi all,
    > > I have wordlist file having format:
    > > ...
    > > abc
    > > abd
    > > abf
    > > abg
    > > abh
    > > ...
    > > # check.pl abc
    > >
    > > if the word is found => ouput OK
    > > if the word is not found => output two words above and below of this
    > > word. For example:
    > > # check.pl abe => output: abd abf

    [snip]
    >
    > The direct application of binary search requires the whole file to be read
    > into memory. With 70,000 words this should be no problem.
    >


    Anything wrong with using Search::Dict? And perhaps something like
    File::ReadBackwards for the not-found condition (not included below)?


    #!/usr/local/bin/perl
    use strict;
    use warnings;
    use Search::Dict;

    my $term = shift||die "Usage: $0 term\n";

    open my $fh, 'datafile' or die $!;
    look $fh, $term;

    my $try = <$fh>||'';
    chomp $try;

    my $found = $term eq $try ? 1 : 0;
    print $found?"OK\n":"Not found\n";


    Regards,

    Brad
     
    Brad Baxter, Nov 3, 2003
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Wybo Dekker
    Replies:
    1
    Views:
    395
    Yukihiro Matsumoto
    Nov 15, 2005
  2. Mohit Sindhwani
    Replies:
    5
    Views:
    212
    Mohit Sindhwani
    Oct 27, 2008
  3. Dr.Ruud
    Replies:
    1
    Views:
    172
    Dr.Ruud
    Apr 1, 2006
  4. vdvorkin
    Replies:
    0
    Views:
    437
    vdvorkin
    Feb 10, 2011
  5. TheDigitalroot

    I wanna make wordlist cutter

    TheDigitalroot, Mar 30, 2013, in forum: Python
    Replies:
    5
    Views:
    138
    TheDigitalroot
    Mar 30, 2013
Loading...

Share This Page