B
babydoe
These are notes to myself, and anyone else
having trouble with the article at:
'http://www.perl.com/pub/a/2003/02/19/engine.html.'
No mass public feels at ease with electronic privacy.
But marketing is very much at ease at invading our
privacy, and marketing has no particular concern with
truth. So we ought to be using privacy to address
lovers, postmen, children and pets. We are not, so far.
Privacy concerns with Google desktop, which is just not
de rigueur, made me look for a replacement: I found
Perl. And in particular, the Perl distribution 'Active
State,' 'http://www.activestate.com.'
With Perl installed, you can roll your own search
engine, and unlike Mr Creepy Fuckin Google's search
engine, this engine, does not go online to index
anything, it does exactly what it should do (what it
should do if it worked, because like everything with
Perl, things almost work, but not quite).
You will need extra Perl modules for your search
engine: 'Lingua-Stem,' which you can from the 'Active
State' central repository, by running the command,
'c:\>ppm i Lingua-Stem;' and also the 'pdl win32
binaries,' named 'PDL-2.4.1-win32-4.zip,' links to the
binaries are available from 'http://pdl.perl.org/.'
Unzip these files and run the batch file
'install-pdl.bat.'
Download sample code, 'Listing 1, VectorSpace.pm,' from
http://www.perl.com/2003/02/19/examples/VectorSpace.pm
and install in the directory, 'c:/Perl/Lib/Search.' The
'VectorSpace.pm' does not work because of the way Perl
handles record separators. You need to comment out the
subroutine 'load_stop_list' in 'VectorySpace.pm', and
replace with the following subroutine.
--%<-----%<----first patch for VectorSpace.pl-----%<--
=item load_stop_list
Hacked by me, because, as written, with record separator
$\ = undef, the entire stop list was slurped up into
one key. Now the hash performs as it should, with each
stop_word being a separate record
=cut
sub load_stop_list {
$_ = <DATA>;
chomp(my @stop_words = split);
my %stop_words;
$stop_words{$_}++ for @stop_words;
return \%stop_words;
}
--%<----%<-----%<----%<-----%<----%<-----%<----%<------
And there is one other error (blatant thank goodness),
which will give the following warning, "Use of
uninitialized value in subroutine entry at
c:/perl/lib//Search/VectorSpace.pm line 175, <DATA>
chunk 1." You need to hack into 'VectorSpace.pm' and
search for the line:
@lookup{@sorted_words} = (1..$#sorted_words );
Funny, but to me, that error sticks out like a
foreskin at a Jewish wedding; of course, replace it
with the line:
@lookup{@sorted_words} = (0..$#sorted_words );
'http://www.perl.com/pub/a/2003/02/19/engine.html,' is
incomplete, in that it never explains how to run it.
(Why are scientists like that? always leaving it to the
candy man, in his bow chicka bow bow purple velvet pimp
suit and hat, to make a practical application.)
For my practical application, I have been given an
orphaned quote, "The taxi moves off slowly, the man
still not having said a word to the driver," and I
want to find which document, in my 'eBooks' directory,
the quote came from. With 'VectorSpace' Perl module
now I can.
I type the sentence into a file 'Quote.txt' and save it
to my desktop.
-----%<----%<-----%<----Quote.txt-----%<----%<----%<--
The taxi moves off slowly, the man still not having said
a word to the driver.
-----%<----%<-----%<----%<-----%<----%<-----%<----%<--
I also save the following script to my desktop;
Arrrrrrrrgh, even to me, who wrote this script, it is a
mess, with Perl magic everywhere. The theory is simple;
I am drilling into my eBooks directory and finding the
closest match to the words in 'Quote.txt.'
-----%<----%<-----%<----searchBooks.pl-----%<----%<---
#!perl
#
use warnings;
use strict;
use File::Glob ':glob';
use Search::VectorSpace;
use File::Temp qw/ tempfile tempdir /;
#
local $/ = undef;
my $homedir = $ENV{'USERPROFILE'}."/My Documents/eBooks";
my @files = <$homedir/*>;
@files = grep -f, @files;
my @docs;
for ( 0 .. $#files ) {
open my $fh, "$files[$_]"
or die "cannot open file $files[$_]: $!";
$docs[$_] = <$fh>;
}
#
my $engine = Search::VectorSpace->new( docs => \@docs,
threshold => 0.04 );
$engine->build_index();
#
my $query = <>;
my %results = $engine->search($query);
my ($fh, $filename) = tempfile(SUFFIX => '.html');
foreach my $result (
sort { $results{$b} <=> $results{$a} }
keys %results
)
{
print "Relevance: ", $results{$result}, "\n";
print $fh $result, "\n\n"; close $fh;
exec $filename;
}
-----%<----%<-----%<----%<-----%<----%<-----%<----%<--
c:\Documents and Settings\Nomen Nescio\Desktop>
perl searchBooks.pl Quote.txt
And Ta-Dum, displayed in my Internet browser is the
ebook "The Story of O, by Pauline Reage," an ebook,
incidently, I accidently downloaded from the #ebooks
channel on IRC. (Note to self: better put encryption on
that book.)
having trouble with the article at:
'http://www.perl.com/pub/a/2003/02/19/engine.html.'
No mass public feels at ease with electronic privacy.
But marketing is very much at ease at invading our
privacy, and marketing has no particular concern with
truth. So we ought to be using privacy to address
lovers, postmen, children and pets. We are not, so far.
Privacy concerns with Google desktop, which is just not
de rigueur, made me look for a replacement: I found
Perl. And in particular, the Perl distribution 'Active
State,' 'http://www.activestate.com.'
With Perl installed, you can roll your own search
engine, and unlike Mr Creepy Fuckin Google's search
engine, this engine, does not go online to index
anything, it does exactly what it should do (what it
should do if it worked, because like everything with
Perl, things almost work, but not quite).
You will need extra Perl modules for your search
engine: 'Lingua-Stem,' which you can from the 'Active
State' central repository, by running the command,
'c:\>ppm i Lingua-Stem;' and also the 'pdl win32
binaries,' named 'PDL-2.4.1-win32-4.zip,' links to the
binaries are available from 'http://pdl.perl.org/.'
Unzip these files and run the batch file
'install-pdl.bat.'
Download sample code, 'Listing 1, VectorSpace.pm,' from
http://www.perl.com/2003/02/19/examples/VectorSpace.pm
and install in the directory, 'c:/Perl/Lib/Search.' The
'VectorSpace.pm' does not work because of the way Perl
handles record separators. You need to comment out the
subroutine 'load_stop_list' in 'VectorySpace.pm', and
replace with the following subroutine.
--%<-----%<----first patch for VectorSpace.pl-----%<--
=item load_stop_list
Hacked by me, because, as written, with record separator
$\ = undef, the entire stop list was slurped up into
one key. Now the hash performs as it should, with each
stop_word being a separate record
=cut
sub load_stop_list {
$_ = <DATA>;
chomp(my @stop_words = split);
my %stop_words;
$stop_words{$_}++ for @stop_words;
return \%stop_words;
}
--%<----%<-----%<----%<-----%<----%<-----%<----%<------
And there is one other error (blatant thank goodness),
which will give the following warning, "Use of
uninitialized value in subroutine entry at
c:/perl/lib//Search/VectorSpace.pm line 175, <DATA>
chunk 1." You need to hack into 'VectorSpace.pm' and
search for the line:
@lookup{@sorted_words} = (1..$#sorted_words );
Funny, but to me, that error sticks out like a
foreskin at a Jewish wedding; of course, replace it
with the line:
@lookup{@sorted_words} = (0..$#sorted_words );
'http://www.perl.com/pub/a/2003/02/19/engine.html,' is
incomplete, in that it never explains how to run it.
(Why are scientists like that? always leaving it to the
candy man, in his bow chicka bow bow purple velvet pimp
suit and hat, to make a practical application.)
For my practical application, I have been given an
orphaned quote, "The taxi moves off slowly, the man
still not having said a word to the driver," and I
want to find which document, in my 'eBooks' directory,
the quote came from. With 'VectorSpace' Perl module
now I can.
I type the sentence into a file 'Quote.txt' and save it
to my desktop.
-----%<----%<-----%<----Quote.txt-----%<----%<----%<--
The taxi moves off slowly, the man still not having said
a word to the driver.
-----%<----%<-----%<----%<-----%<----%<-----%<----%<--
I also save the following script to my desktop;
Arrrrrrrrgh, even to me, who wrote this script, it is a
mess, with Perl magic everywhere. The theory is simple;
I am drilling into my eBooks directory and finding the
closest match to the words in 'Quote.txt.'
-----%<----%<-----%<----searchBooks.pl-----%<----%<---
#!perl
#
use warnings;
use strict;
use File::Glob ':glob';
use Search::VectorSpace;
use File::Temp qw/ tempfile tempdir /;
#
local $/ = undef;
my $homedir = $ENV{'USERPROFILE'}."/My Documents/eBooks";
my @files = <$homedir/*>;
@files = grep -f, @files;
my @docs;
for ( 0 .. $#files ) {
open my $fh, "$files[$_]"
or die "cannot open file $files[$_]: $!";
$docs[$_] = <$fh>;
}
#
my $engine = Search::VectorSpace->new( docs => \@docs,
threshold => 0.04 );
$engine->build_index();
#
my $query = <>;
my %results = $engine->search($query);
my ($fh, $filename) = tempfile(SUFFIX => '.html');
foreach my $result (
sort { $results{$b} <=> $results{$a} }
keys %results
)
{
print "Relevance: ", $results{$result}, "\n";
print $fh $result, "\n\n"; close $fh;
exec $filename;
}
-----%<----%<-----%<----%<-----%<----%<-----%<----%<--
From the command line I type
c:\Documents and Settings\Nomen Nescio\Desktop>
perl searchBooks.pl Quote.txt
And Ta-Dum, displayed in my Internet browser is the
ebook "The Story of O, by Pauline Reage," an ebook,
incidently, I accidently downloaded from the #ebooks
channel on IRC. (Note to self: better put encryption on
that book.)