Vector Space Search Engine

Discussion in 'Perl Misc' started by babydoe@mailinator.com, Oct 11, 2005.

  1. Guest

    These are notes to myself, and anyone else
    having trouble with the article at:
    'http://www.perl.com/pub/a/2003/02/19/engine.html.'

    No mass public feels at ease with electronic privacy.
    But marketing is very much at ease at invading our
    privacy, and marketing has no particular concern with
    truth. So we ought to be using privacy to address
    lovers, postmen, children and pets. We are not, so far.

    Privacy concerns with Google desktop, which is just not
    de rigueur, made me look for a replacement: I found
    Perl. And in particular, the Perl distribution 'Active
    State,' 'http://www.activestate.com.'

    With Perl installed, you can roll your own search
    engine, and unlike Mr Creepy Fuckin Google's search
    engine, this engine, does not go online to index
    anything, it does exactly what it should do (what it
    should do if it worked, because like everything with
    Perl, things almost work, but not quite).

    You will need extra Perl modules for your search
    engine: 'Lingua-Stem,' which you can from the 'Active
    State' central repository, by running the command,
    'c:\>ppm i Lingua-Stem;' and also the 'pdl win32
    binaries,' named 'PDL-2.4.1-win32-4.zip,' links to the
    binaries are available from 'http://pdl.perl.org/.'
    Unzip these files and run the batch file
    'install-pdl.bat.'

    Download sample code, 'Listing 1, VectorSpace.pm,' from
    http://www.perl.com/2003/02/19/examples/VectorSpace.pm
    and install in the directory, 'c:/Perl/Lib/Search.' The
    'VectorSpace.pm' does not work because of the way Perl
    handles record separators. You need to comment out the
    subroutine 'load_stop_list' in 'VectorySpace.pm', and
    replace with the following subroutine.

    --%<-----%<----first patch for VectorSpace.pl-----%<--
    =item load_stop_list

    Hacked by me, because, as written, with record separator
    $\ = undef, the entire stop list was slurped up into
    one key. Now the hash performs as it should, with each
    stop_word being a separate record

    =cut

    sub load_stop_list {
    $_ = <DATA>;
    chomp(my @stop_words = split);
    my %stop_words;
    $stop_words{$_}++ for @stop_words;
    return \%stop_words;
    }
    --%<----%<-----%<----%<-----%<----%<-----%<----%<------

    And there is one other error (blatant thank goodness),
    which will give the following warning, "Use of
    uninitialized value in subroutine entry at
    c:/perl/lib//Search/VectorSpace.pm line 175, <DATA>
    chunk 1." You need to hack into 'VectorSpace.pm' and
    search for the line:

    @lookup{@sorted_words} = (1..$#sorted_words );

    Funny, but to me, that error sticks out like a
    foreskin at a Jewish wedding; of course, replace it
    with the line:

    @lookup{@sorted_words} = (0..$#sorted_words );

    'http://www.perl.com/pub/a/2003/02/19/engine.html,' is
    incomplete, in that it never explains how to run it.
    (Why are scientists like that? always leaving it to the
    candy man, in his bow chicka bow bow purple velvet pimp
    suit and hat, to make a practical application.)

    For my practical application, I have been given an
    orphaned quote, "The taxi moves off slowly, the man
    still not having said a word to the driver," and I
    want to find which document, in my 'eBooks' directory,
    the quote came from. With 'VectorSpace' Perl module
    now I can.

    I type the sentence into a file 'Quote.txt' and save it
    to my desktop.

    -----%<----%<-----%<----Quote.txt-----%<----%<----%<--
    The taxi moves off slowly, the man still not having said
    a word to the driver.
    -----%<----%<-----%<----%<-----%<----%<-----%<----%<--

    I also save the following script to my desktop;
    Arrrrrrrrgh, even to me, who wrote this script, it is a
    mess, with Perl magic everywhere. The theory is simple;
    I am drilling into my eBooks directory and finding the
    closest match to the words in 'Quote.txt.'

    -----%<----%<-----%<----searchBooks.pl-----%<----%<---
    #!perl
    #
    use warnings;
    use strict;
    use File::Glob ':glob';
    use Search::VectorSpace;
    use File::Temp qw/ tempfile tempdir /;
    #
    local $/ = undef;
    my $homedir = $ENV{'USERPROFILE'}."/My Documents/eBooks";
    my @files = <$homedir/*>;
    @files = grep -f, @files;
    my @docs;
    for ( 0 .. $#files ) {
    open my $fh, "$files[$_]"
    or die "cannot open file $files[$_]: $!";
    $docs[$_] = <$fh>;
    }
    #
    my $engine = Search::VectorSpace->new( docs => \@docs,
    threshold => 0.04 );
    $engine->build_index();
    #
    my $query = <>;
    my %results = $engine->search($query);
    my ($fh, $filename) = tempfile(SUFFIX => '.html');
    foreach my $result (
    sort { $results{$b} <=> $results{$a} }
    keys %results
    )
    {
    print "Relevance: ", $results{$result}, "\n";
    print $fh $result, "\n\n"; close $fh;
    exec $filename;
    }
    -----%<----%<-----%<----%<-----%<----%<-----%<----%<--

    >From the command line I type


    c:\Documents and Settings\Nomen Nescio\Desktop>
    perl searchBooks.pl Quote.txt

    And Ta-Dum, displayed in my Internet browser is the
    ebook "The Story of O, by Pauline Reage," an ebook,
    incidently, I accidently downloaded from the #ebooks
    channel on IRC. (Note to self: better put encryption on
    that book.)
     
    , Oct 11, 2005
    #1
    1. Advertising

  2. wrote in news:1129018494.679490.52800
    @g44g2000cwa.googlegroups.com:

    > With Perl installed, you can roll your own search
    > engine, and unlike Mr Creepy Fuckin Google's search


    Upon reading this, I went ahead and added you to my killfile. However, I
    had already started commenting on your code, so, here they are:

    > The 'VectorSpace.pm' does not work because of the way Perl
    > handles record separators. You need to comment out the
    > subroutine 'load_stop_list' in 'VectorySpace.pm', and
    > replace with the following subroutine.


    This is misleading.

    > --%<-----%<----first patch for VectorSpace.pl-----%<--
    > =item load_stop_list
    >
    > Hacked by me, because, as written, with record separator
    > $\ = undef, the entire stop list was slurped up into
    > one key.


    The problem stems from you slapping a

    local $\;

    at the top of your program. (You also set it to undef, indicating you do
    not understand how local works).

    You should restrict changes from default behavior to the smallest
    possible scope.

    [ more drivel laced with profanity snipped ]

    > -----%<----%<-----%<----searchBooks.pl-----%<----%<---
    > #!perl
    > #
    > use warnings;
    > use strict;
    > use File::Glob ':glob';
    > use Search::VectorSpace;
    > use File::Temp qw/ tempfile tempdir /;
    > #
    > local $/ = undef;


    This should not be here.

    > my $homedir = $ENV{'USERPROFILE'}."/My Documents/eBooks";
    > my @files = <$homedir/*>;
    > @files = grep -f, @files;
    > my @docs;
    > for ( 0 .. $#files ) {
    > open my $fh, "$files[$_]"


    Useless use of quotes.

    > or die "cannot open file $files[$_]: $!";


    local $\;

    > $docs[$_] = <$fh>;
    > }


    You should put the

    in the body of the for loop

    ....

    > my ($fh, $filename) = tempfile(SUFFIX => '.html');


    ....

    > print "Relevance: ", $results{$result}, "\n";
    > print $fh $result, "\n\n"; close $fh;


    You are writing plain text to an html file. Newlines won't help you
    display it the way you seem to want.

    Bye.

    Sinan
    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Oct 11, 2005
    #2
    1. Advertising

  3. <> wrote:

    > Mr Creepy Fuckin Google's search
    > engine,



    This is a family newsgroup.

    Please attempt to develop a richer vocabulary so you won't have
    to resort to vulgarity as a placeholder for something meaningful.


    > Hacked by me, because, as written, with record separator
    > $\ = undef, the entire stop list was slurped up into
    > one key.



    I seriously doubt that the *output* record separator has
    an effect on *input* ...


    > And there is one other error (blatant thank goodness),
    > which will give the following warning, "Use of
    > uninitialized value



    That is not an error message. It is a warning message.


    > open my $fh, "$files[$_]"



    perldoc -q vars


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Oct 11, 2005
    #3
  4. "A. Sinan Unur" <> wrote in
    news:Xns96EC5120845A8asu1cornelledu@127.0.0.1:

    > local $\;


    This should have been:

    local $/;

    as pointed out by Tad.

    Arrrgh!

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Oct 11, 2005
    #4
  5. Guest

    A. Sinan Unur writes:
    > I wrote:


    > Upon reading this, I went ahead and added you to my
    > killfile. However, I had already started commenting
    > on your code, so, here they are:


    I was expecting no commentary on my post, but thank
    you anyway. Though our meeting was brief, I will
    always have the images of you peering with
    lugubriously feigned interest at the boilerplated
    buttocks of my code.

    >> -----%<----%<-----%<----searchBooks.pl-----%<----
    >> #!perl
    >> #
    >> use warnings;
    >> use strict;
    >> use File::Glob ':glob';
    >> use Search::VectorSpace;
    >> use File::Temp qw/ tempfile tempdir /;
    >> #
    >> local $/ = undef;

    >
    >
    >This should not be here.
    >
    >
    >> my $homedir = $ENV{'USERPROFILE'} .
    >> "/My Documents/eBooks";
    >> my @files = <$homedir/*>;
    >> @files = grep -f, @files;
    >> my @docs;
    >> for ( 0 .. $#files ) {
    >> open my $fh, "$files[$_]"

    >
    >
    >Useless use of quotes.
    >
    >
    >> or die "cannot open file $files[$_]: $!";

    >
    >
    > local $\;
    >
    >
    >> $docs[$_] = <$fh>;
    >> }


    Code changed as per your suggestions.

    Bye
     
    , Oct 11, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Shuo Xiang

    Stack space, global space, heap space

    Shuo Xiang, Jul 9, 2003, in forum: C Programming
    Replies:
    10
    Views:
    2,986
    Bryan Bullard
    Jul 11, 2003
  2. Christian Seberino
    Replies:
    21
    Views:
    1,811
    Stephen Horne
    Oct 27, 2003
  3. Ian Bicking
    Replies:
    2
    Views:
    1,112
    Steve Lamb
    Oct 23, 2003
  4. Ian Bicking
    Replies:
    2
    Views:
    787
    Michael Hudson
    Oct 24, 2003
  5. Replies:
    8
    Views:
    2,002
    Csaba
    Feb 18, 2006
Loading...

Share This Page