Vector Space Search Engine

B

babydoe

These are notes to myself, and anyone else
having trouble with the article at:
'http://www.perl.com/pub/a/2003/02/19/engine.html.'

No mass public feels at ease with electronic privacy.
But marketing is very much at ease at invading our
privacy, and marketing has no particular concern with
truth. So we ought to be using privacy to address
lovers, postmen, children and pets. We are not, so far.

Privacy concerns with Google desktop, which is just not
de rigueur, made me look for a replacement: I found
Perl. And in particular, the Perl distribution 'Active
State,' 'http://www.activestate.com.'

With Perl installed, you can roll your own search
engine, and unlike Mr Creepy Fuckin Google's search
engine, this engine, does not go online to index
anything, it does exactly what it should do (what it
should do if it worked, because like everything with
Perl, things almost work, but not quite).

You will need extra Perl modules for your search
engine: 'Lingua-Stem,' which you can from the 'Active
State' central repository, by running the command,
'c:\>ppm i Lingua-Stem;' and also the 'pdl win32
binaries,' named 'PDL-2.4.1-win32-4.zip,' links to the
binaries are available from 'http://pdl.perl.org/.'
Unzip these files and run the batch file
'install-pdl.bat.'

Download sample code, 'Listing 1, VectorSpace.pm,' from
http://www.perl.com/2003/02/19/examples/VectorSpace.pm
and install in the directory, 'c:/Perl/Lib/Search.' The
'VectorSpace.pm' does not work because of the way Perl
handles record separators. You need to comment out the
subroutine 'load_stop_list' in 'VectorySpace.pm', and
replace with the following subroutine.

--%<-----%<----first patch for VectorSpace.pl-----%<--
=item load_stop_list

Hacked by me, because, as written, with record separator
$\ = undef, the entire stop list was slurped up into
one key. Now the hash performs as it should, with each
stop_word being a separate record

=cut

sub load_stop_list {
$_ = <DATA>;
chomp(my @stop_words = split);
my %stop_words;
$stop_words{$_}++ for @stop_words;
return \%stop_words;
}
--%<----%<-----%<----%<-----%<----%<-----%<----%<------

And there is one other error (blatant thank goodness),
which will give the following warning, "Use of
uninitialized value in subroutine entry at
c:/perl/lib//Search/VectorSpace.pm line 175, <DATA>
chunk 1." You need to hack into 'VectorSpace.pm' and
search for the line:

@lookup{@sorted_words} = (1..$#sorted_words );

Funny, but to me, that error sticks out like a
foreskin at a Jewish wedding; of course, replace it
with the line:

@lookup{@sorted_words} = (0..$#sorted_words );

'http://www.perl.com/pub/a/2003/02/19/engine.html,' is
incomplete, in that it never explains how to run it.
(Why are scientists like that? always leaving it to the
candy man, in his bow chicka bow bow purple velvet pimp
suit and hat, to make a practical application.)

For my practical application, I have been given an
orphaned quote, "The taxi moves off slowly, the man
still not having said a word to the driver," and I
want to find which document, in my 'eBooks' directory,
the quote came from. With 'VectorSpace' Perl module
now I can.

I type the sentence into a file 'Quote.txt' and save it
to my desktop.

-----%<----%<-----%<----Quote.txt-----%<----%<----%<--
The taxi moves off slowly, the man still not having said
a word to the driver.
-----%<----%<-----%<----%<-----%<----%<-----%<----%<--

I also save the following script to my desktop;
Arrrrrrrrgh, even to me, who wrote this script, it is a
mess, with Perl magic everywhere. The theory is simple;
I am drilling into my eBooks directory and finding the
closest match to the words in 'Quote.txt.'

-----%<----%<-----%<----searchBooks.pl-----%<----%<---
#!perl
#
use warnings;
use strict;
use File::Glob ':glob';
use Search::VectorSpace;
use File::Temp qw/ tempfile tempdir /;
#
local $/ = undef;
my $homedir = $ENV{'USERPROFILE'}."/My Documents/eBooks";
my @files = <$homedir/*>;
@files = grep -f, @files;
my @docs;
for ( 0 .. $#files ) {
open my $fh, "$files[$_]"
or die "cannot open file $files[$_]: $!";
$docs[$_] = <$fh>;
}
#
my $engine = Search::VectorSpace->new( docs => \@docs,
threshold => 0.04 );
$engine->build_index();
#
my $query = <>;
my %results = $engine->search($query);
my ($fh, $filename) = tempfile(SUFFIX => '.html');
foreach my $result (
sort { $results{$b} <=> $results{$a} }
keys %results
)
{
print "Relevance: ", $results{$result}, "\n";
print $fh $result, "\n\n"; close $fh;
exec $filename;
}
-----%<----%<-----%<----%<-----%<----%<-----%<----%<--
From the command line I type

c:\Documents and Settings\Nomen Nescio\Desktop>
perl searchBooks.pl Quote.txt

And Ta-Dum, displayed in my Internet browser is the
ebook "The Story of O, by Pauline Reage," an ebook,
incidently, I accidently downloaded from the #ebooks
channel on IRC. (Note to self: better put encryption on
that book.)
 
A

A. Sinan Unur

(e-mail address removed) wrote in @g44g2000cwa.googlegroups.com:
With Perl installed, you can roll your own search
engine, and unlike Mr Creepy Fuckin Google's search

Upon reading this, I went ahead and added you to my killfile. However, I
had already started commenting on your code, so, here they are:
The 'VectorSpace.pm' does not work because of the way Perl
handles record separators. You need to comment out the
subroutine 'load_stop_list' in 'VectorySpace.pm', and
replace with the following subroutine.

This is misleading.
--%<-----%<----first patch for VectorSpace.pl-----%<--
=item load_stop_list

Hacked by me, because, as written, with record separator
$\ = undef, the entire stop list was slurped up into
one key.

The problem stems from you slapping a

local $\;

at the top of your program. (You also set it to undef, indicating you do
not understand how local works).

You should restrict changes from default behavior to the smallest
possible scope.

[ more drivel laced with profanity snipped ]
-----%<----%<-----%<----searchBooks.pl-----%<----%<---
#!perl
#
use warnings;
use strict;
use File::Glob ':glob';
use Search::VectorSpace;
use File::Temp qw/ tempfile tempdir /;
#
local $/ = undef;

This should not be here.
my $homedir = $ENV{'USERPROFILE'}."/My Documents/eBooks";
my @files = <$homedir/*>;
@files = grep -f, @files;
my @docs;
for ( 0 .. $#files ) {
open my $fh, "$files[$_]"

Useless use of quotes.
or die "cannot open file $files[$_]: $!";

local $\;
$docs[$_] = <$fh>;
}

You should put the

in the body of the for loop

....
my ($fh, $filename) = tempfile(SUFFIX => '.html');
....

print "Relevance: ", $results{$result}, "\n";
print $fh $result, "\n\n"; close $fh;

You are writing plain text to an html file. Newlines won't help you
display it the way you seem to want.

Bye.

Sinan
 
T

Tad McClellan

Mr Creepy Fuckin Google's search
engine,


This is a family newsgroup.

Please attempt to develop a richer vocabulary so you won't have
to resort to vulgarity as a placeholder for something meaningful.

Hacked by me, because, as written, with record separator
$\ = undef, the entire stop list was slurped up into
one key.


I seriously doubt that the *output* record separator has
an effect on *input* ...

And there is one other error (blatant thank goodness),
which will give the following warning, "Use of
uninitialized value


That is not an error message. It is a warning message.

open my $fh, "$files[$_]"


perldoc -q vars
 
B

babydoe

A. Sinan Unur said:
Upon reading this, I went ahead and added you to my
killfile. However, I had already started commenting
on your code, so, here they are:

I was expecting no commentary on my post, but thank
you anyway. Though our meeting was brief, I will
always have the images of you peering with
lugubriously feigned interest at the boilerplated
buttocks of my code.
-----%<----%<-----%<----searchBooks.pl-----%<----
#!perl
#
use warnings;
use strict;
use File::Glob ':glob';
use Search::VectorSpace;
use File::Temp qw/ tempfile tempdir /;
#
local $/ = undef;


This should not be here.

my $homedir = $ENV{'USERPROFILE'} .
"/My Documents/eBooks";
my @files = <$homedir/*>;
@files = grep -f, @files;
my @docs;
for ( 0 .. $#files ) {
open my $fh, "$files[$_]"


Useless use of quotes.

or die "cannot open file $files[$_]: $!";


local $\;

$docs[$_] = <$fh>;
}

Code changed as per your suggestions.

Bye
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,717
Messages
2,569,382
Members
44,704
Latest member
FawnBernay

Latest Threads

Top