How to avoid searching this folder?

Discussion in 'Perl Misc' started by geoff@invalid.invalid, Mar 25, 2011.

  1. Guest

    Hello

    I am using Tom Boutell's simple search engine on my website but would
    like it to not index the files in a particular folder called archives.

    How would I modify the code for this? I have tried and so far failed.

    Thanks

    Geoff

    #!/usr/bin/perl

    $path = "/path/public_html";
    $webpath = "";
    $indexname = "/path/formmail/searchindex.txt";

    $nextFd = 0;

    open(OUT, ">$indexname");

    &update($path, $webpath);

    sub update {
    my($path, $webpath) = @_;
    my($dd) = $nextFd++;
    print "Updating in $path\n";
    if (!opendir($dd, $path)) {
    print STDERR "Warning: can't open $path\n";
    return;
    }
    while ($entry = readdir($dd)) {

    if ($entry =~ /^\.$/) {
    next;
    }

    if ($entry =~ /^\.\.$/) {
    next;
    }
    if (-d "$path/$entry") {
    &update("$path/$entry", "$webpath/$entry");
    next;
    }
    if (($entry !~ /.html$/i) && ($entry !~ /.htm$/i)) {
    next;
    }
    my($fd) = $nextFd++;
    if (!open($fd, "$path/$entry")) {
    print STDERR "Warning: can't open
    $path/$entry\n";
    next;
    }
    my(%words) = ( );
    my($line);
    while ($line = <$fd>) {
    # Support for turning off the search engine
    # indexer for parts of a page. These markers
    # must have a line to themselves. 3/13/00
    if ($line =~ /<\!\-\- SEARCH-ENGINE-OFF -->/)
    {
    while ($line = <$fd>) {
    if ($line =~ /<\!\-\-
    SEARCH-ENGINE-ON -->/) {
    last;
    }
    }
    next;
    }
    # Simple HTML flusher
    $line =~ s/\<.*?\>//g;
    # Case insensitive
    $line =~ tr/A-Z/a-z/;
    # If it's not a letter, it's whitespace
    $line =~ s/[^a-z]/ /g;
    my(@words) = split(/\s+/, $line);
    my($p);
    for $p (@words) {
    if (length($p)) {
    $words{$p}++;
    }
    }
    }
    print OUT "$webpath/$entry ";
    my($first) = 1;
    while (($key, $val) = each(%words)) {
    print OUT "$val:$key";
    if ($first) {
    $first = 0;
    } else {
    print OUT " ";
    }
    }
    print OUT "\n";
    close($fd);
    }
    closedir($dd);
    }
    close(OUT);
    , Mar 25, 2011
    #1
    1. Advertising

  2. Am 25.3.2011 schrub geoff:

    > Hello
    >
    > I am using Tom Boutell's simple search engine on my website but would
    > like it to not index the files in a particular folder called archives.
    >
    > How would I modify the code for this? I have tried and so far failed.


    You're usually expected to explain/show what you've tried so far ...

    I'd put in a check immediately at the beginning of the update sub.
    Unless ... you tried that and it didn't work!

    Josef
    --
    These are my personal views and not those of Fujitsu Technology Solutions!
    Josef Möllers (Pinguinpfleger bei FTS)
    If failure had no penalty success would not be a prize (T. Pratchett)
    Company Details: http://de.ts.fujitsu.com/imprint.html
    Josef Moellers, Mar 25, 2011
    #2
    1. Advertising

  3. Guest

    On Fri, 25 Mar 2011 11:17:13 +0100, Josef Moellers
    <> wrote:

    >Am 25.3.2011 schrub geoff:
    >
    >> Hello
    >>
    >> I am using Tom Boutell's simple search engine on my website but would
    >> like it to not index the files in a particular folder called archives.
    >>
    >> How would I modify the code for this? I have tried and so far failed.

    >
    >You're usually expected to explain/show what you've tried so far ...


    Joseph,

    I tried

    if (!-d "$path/archives") {
    next;
    }

    thinking that the code continues so long as the archives folder is not
    found..

    Can !-d be used like this? It seems not!

    Cheers

    Geoff

    >
    >I'd put in a check immediately at the beginning of the update sub.
    >Unless ... you tried that and it didn't work!
    >
    >Josef
    , Mar 25, 2011
    #3
  4. Guest

    On Fri, 25 Mar 2011 11:17:13 +0100, Josef Moellers
    <> wrote:

    >Am 25.3.2011 schrub geoff:
    >
    >> Hello
    >>
    >> I am using Tom Boutell's simple search engine on my website but would
    >> like it to not index the files in a particular folder called archives.
    >>
    >> How would I modify the code for this? I have tried and so far failed.

    >
    >You're usually expected to explain/show what you've tried so far ...


    Joseph

    I have tried this too

    #if (-d "$path/$entry") {
    #&update("$path/$entry", "$webpath/$entry");
    #next;
    #}

    if (($entry != "archives") && (-d "$path/$entry")) {
    &update("$path/$entry", "$webpath/$entry");
    next;
    }

    thinking that so long as the $entry is not a directory called archives
    the code will continue ... still no go.

    Geoff


    >
    >I'd put in a check immediately at the beginning of the update sub.
    >Unless ... you tried that and it didn't work!
    >
    >Josef
    , Mar 25, 2011
    #4
  5. Guest

    On Fri, 25 Mar 2011 07:51:44 -0500, Tad McClellan
    <> wrote:

    > <> wrote:
    >
    >> if (($entry != "archives") && (-d "$path/$entry")) {

    > ^^^^^^^^^^^^^^^^^^^^
    >
    >You are comparing them as numbers, so each string is converted
    >into a number before the comparison is tested.


    Thanks tad. I changed to ne but still produces an empty file.

    I have tried

    unless (($entry !~ /members/i) && (-d "$path/$entry")) {
    &update("$path/$entry", "$webpath/$entry");
    next;
    }

    but still no go! I guess my logic is wrong?

    Cheers

    Geoff


    >
    >You end up testing
    >
    > 0 != 0
    >
    >which will *always* be false.
    >
    >The "ne" operator is for comparing strings.
    >
    >The "!=" operator is for comparing numbers.
    , Mar 25, 2011
    #5
  6. Am 25.3.2011 schrub geoff:

    > On Fri, 25 Mar 2011 11:17:13 +0100, Josef Moellers
    > <> wrote:
    >
    >> Am 25.3.2011 schrub geoff:
    >>
    >>> Hello
    >>>
    >>> I am using Tom Boutell's simple search engine on my website but would
    >>> like it to not index the files in a particular folder called archives.
    >>>
    >>> How would I modify the code for this? I have tried and so far failed.

    >>
    >> You're usually expected to explain/show what you've tried so far ...

    >
    > Joseph
    >
    > I have tried this too
    >
    > #if (-d "$path/$entry") {
    > #&update("$path/$entry", "$webpath/$entry");
    > #next;
    > #}
    >
    > if (($entry != "archives") && (-d "$path/$entry")) {
    > &update("$path/$entry", "$webpath/$entry");
    > next;
    > }
    >
    > thinking that so long as the $entry is not a directory called archives
    > the code will continue ... still no go.



    Tad commented about the "!=" vs "ne".

    What about case? Maybe the name isn't spelled "archives" but "ARCHIVES"?
    Try "if ((lc($entry) ne "archives") && (-d "$path/$entry"))"

    Also: have a look at File::Find. You may be re-inventing the wheel.

    NB you can drop the "&" in front of the function call.
    Josef

    --
    These are my personal views and not those of Fujitsu Technology Solutions!
    Josef Möllers (Pinguinpfleger bei FTS)
    If failure had no penalty success would not be a prize (T. Pratchett)
    Company Details: http://de.ts.fujitsu.com/imprint.html
    Josef Moellers, Mar 25, 2011
    #6
  7. Jim Gibson Guest

    In article <>,
    <> wrote:

    > On Fri, 25 Mar 2011 11:17:13 +0100, Josef Moellers
    > <> wrote:
    >
    > I have tried this too
    >
    > if (($entry != "archives") && (-d "$path/$entry")) {
    > &update("$path/$entry", "$webpath/$entry");
    > next;
    > }
    >
    > thinking that so long as the $entry is not a directory called archives
    > the code will continue ... still no go.


    Try this:

    if ( -d "$path/$entry")) {
    next if $entry =~ /archive/i;
    update("$path/$entry", "$webpath/$entry");
    next;
    }

    --
    Jim Gibson
    Jim Gibson, Mar 25, 2011
    #7
  8. Guest

    On Fri, 25 Mar 2011 15:11:58 +0100, Josef Moellers
    <> wrote:

    >Tad commented about the "!=" vs "ne".
    >
    >What about case? Maybe the name isn't spelled "archives" but "ARCHIVES"?
    >Try "if ((lc($entry) ne "archives") && (-d "$path/$entry"))"


    Joseph,

    I tried the above but the code does not work - it produces an empty
    searchindex.txt file.

    Maybe I have got the wrong approach - could you please look at this
    part of the code again and suggest how it should be changed to avoid
    indexing the archives folder?


    sub update {
    my($path, $webpath) = @_;
    my($dd) = $nextFd++;
    print "Updating in $path\n";
    if (!opendir($dd, $path)) {
    print STDERR "Warning: can't open $path\n";
    return;
    }
    while ($entry = readdir($dd)) {
    if ($entry =~ /^\.$/) {
    next;
    }
    if ($entry =~ /^\.\.$/) {
    next;
    }
    if (-d "$path/$entry") {
    &update("$path/$entry", "$webpath/$entry");
    next;
    }
    if (($entry !~ /.html$/i) && ($entry !~ /.htm$/i)) {
    next;
    }
    my($fd) = $nextFd++;
    if (!open($fd, "$path/$entry")) {
    print STDERR "Warning: can't open
    $path/$entry\n";
    next;
    }
    my(%words) = ( );
    my($line);
    while ($line = <$fd>) {


    >
    >Also: have a look at File::Find. You may be re-inventing the wheel.


    I don't think I would dare do this at the moment!

    Cheers

    Geoff

    >
    >NB you can drop the "&" in front of the function call.
    >Josef
    , Mar 25, 2011
    #8
  9. Guest

    On Fri, 25 Mar 2011 15:11:58 +0100, Josef Moellers
    <> wrote:

    Joseph,

    I seem to have got this now!

    if ($entry ne "archives" ) {
    if (-d "$path/$entry") {
    &update("$path/$entry", "$webpath/$entry");
    next;
    }
    }

    The above now avoids the archives folder.

    Cheers

    Geoff
    , Mar 25, 2011
    #9
  10. Guest

    On Fri, 25 Mar 2011 08:36:43 -0700, Jim Gibson <>
    wrote:

    >In article <>,
    ><> wrote:
    >
    >> On Fri, 25 Mar 2011 11:17:13 +0100, Josef Moellers
    >> <> wrote:
    >>
    >> I have tried this too
    >>
    >> if (($entry != "archives") && (-d "$path/$entry")) {
    >> &update("$path/$entry", "$webpath/$entry");
    >> next;
    >> }
    >>
    >> thinking that so long as the $entry is not a directory called archives
    >> the code will continue ... still no go.

    >
    >Try this:
    >
    > if ( -d "$path/$entry")) {
    > next if $entry =~ /archive/i;
    > update("$path/$entry", "$webpath/$entry");
    > next;
    > }



    Thanks Jim - yours is much neater than my

    if ($entry ne "archives" ) {
    if (-d "$path/$entry") {
    &update("$path/$entry", "$webpath/$entry");
    next;
    }
    }

    !

    Cheers

    Geoff
    , Mar 25, 2011
    #10
  11. d wrote:
    > Hello


    Hello,

    > I am using Tom Boutell's simple search engine on my website but would
    > like it to not index the files in a particular folder called archives.
    >
    > How would I modify the code for this? I have tried and so far failed.
    >
    > Thanks
    >
    > Geoff
    >
    > #!/usr/bin/perl


    The next two lines should be:

    use warnings;
    use strict;


    > $path = "/path/public_html";
    > $webpath = "";
    > $indexname = "/path/formmail/searchindex.txt";


    my $path = "/path/public_html";
    my $webpath = "";
    my $indexname = "/path/formmail/searchindex.txt";


    > $nextFd = 0;


    It looks like you don't really need this variable, so what is it really
    supposed to do for your program?


    > open(OUT, ">$indexname");


    You should *always* verify that the file was opened correctly before
    trying to use what may be an invalid filehandle:

    open OUT, '>', $indexname or die "Cannot open '$indexname' because: $!";


    > &update($path, $webpath);


    In modern versions of Perl you don't need to use ampersands on
    subroutine calls:

    update($path, $webpath);


    > sub update {
    > my($path, $webpath) = @_;
    > my($dd) = $nextFd++;


    Why are you storing a number in a variable that you are going to use for
    a directory handle? That makes no sense.


    > print "Updating in $path\n";
    > if (!opendir($dd, $path)) {
    > print STDERR "Warning: can't open $path\n";
    > return;
    > }


    You should declare variables where you first use them and you should
    include $! in the error message so you know why it failed:

    opendir my $dd, $path or do {
    warn "Warning: can't open '$path' because: $!";
    return;
    };


    > while ($entry = readdir($dd)) {


    while ( my $entry = readdir $dd ) {


    > if ($entry =~ /^\.$/) {
    > next;
    > }
    >
    > if ($entry =~ /^\.\.$/) {
    > next;
    > }


    Or simply:

    next if $entry =~ /\A\.\.?\z/;


    > if (-d "$path/$entry") {
    > &update("$path/$entry", "$webpath/$entry");
    > next;
    > }
    > if (($entry !~ /.html$/i)&& ($entry !~ /.htm$/i)) {
    > next;
    > }


    You have to escape the period or it will match any character and you can
    combine both regular expressions into one (same as example above):

    next if $entry !~ /\.html?$/i;


    > my($fd) = $nextFd++;


    Why are you storing a number in a variable that you are going to use for
    a filehandle? That makes no sense.


    > if (!open($fd, "$path/$entry")) {
    > print STDERR "Warning: can't open
    > $path/$entry\n";
    > next;
    > }


    You should declare variables where you first use them and you should
    include $! in the error message so you know why it failed:

    open my $fd, '<', "$path/$entry" or do {
    warn "Warning: can't open '$path/$entry' because: $!";
    next;
    };


    > my(%words) = ( );


    Or just:

    my %words;


    > my($line);
    > while ($line =<$fd>) {


    Or just:

    while ( my $line = <$fd> ) {


    > # Support for turning off the search engine
    > # indexer for parts of a page. These markers
    > # must have a line to themselves. 3/13/00
    > if ($line =~ /<\!\-\- SEARCH-ENGINE-OFF -->/)
    > {
    > while ($line =<$fd>) {
    > if ($line =~ /<\!\-\-
    > SEARCH-ENGINE-ON -->/) {
    > last;
    > }
    > }
    > next;
    > }
    > # Simple HTML flusher
    > $line =~ s/\<.*?\>//g;
    > # Case insensitive
    > $line =~ tr/A-Z/a-z/;
    > # If it's not a letter, it's whitespace
    > $line =~ s/[^a-z]/ /g;


    You could also use tr/// for that:

    $line =~ tr/a-z/ /c;


    > my(@words) = split(/\s+/, $line);


    That might be better as:

    my @words = split ' ', $line;


    > my($p);
    > for $p (@words) {


    Better as:

    for my $p ( @words ) {


    > if (length($p)) {


    Why would $p have zero length? Probably because you are using /\s+/
    instead of ' ' as the first argument to split which will give you a zero
    length string if there is leading whitespace in $line.


    > $words{$p}++;
    > }
    > }
    > }
    > print OUT "$webpath/$entry ";
    > my($first) = 1;


    Why are you forcing list context on a scalar assignment?


    > while (($key, $val) = each(%words)) {


    Better as:

    while ( my ( $key, $val ) = each %words ) {


    > print OUT "$val:$key";
    > if ($first) {
    > $first = 0;
    > } else {
    > print OUT " ";
    > }


    So you want no space between the first and second "$val:$key" but a
    space after every other occurrence of "$val:$key" including at the end
    of the line?


    > }
    > print OUT "\n";


    It looks like you could probably do that while loop like this instead:

    print OUT join( ' ', map "$words{$_}:$_", keys %words ), "\n";


    > close($fd);
    > }
    > closedir($dd);
    > }
    > close(OUT);





    John
    --
    Any intelligent fool can make things bigger and
    more complex... It takes a touch of genius -
    and a lot of courage to move in the opposite
    direction. -- Albert Einstein
    John W. Krahn, Mar 26, 2011
    #11
  12. Guest

    On Fri, 25 Mar 2011 23:59:18 -0700, "John W. Krahn"
    <> wrote:

    > wrote:
    >> Hello

    >
    >Hello,
    >
    >> I am using Tom Boutell's simple search engine on my website but would
    >> like it to not index the files in a particular folder called archives.
    >>
    >> How would I modify the code for this? I have tried and so far failed.
    >>
    >> Thanks
    >>
    >> Geoff
    >>
    >> #!/usr/bin/perl

    >
    >The next two lines should be:
    >
    >use warnings;
    >use strict;
    >
    >
    >> $path = "/path/public_html";
    >> $webpath = "";
    >> $indexname = "/path/formmail/searchindex.txt";

    >
    >my $path = "/path/public_html";
    >my $webpath = "";
    >my $indexname = "/path/formmail/searchindex.txt";
    >
    >
    >> $nextFd = 0;

    >
    >It looks like you don't really need this variable, so what is it really
    >supposed to do for your program?
    >
    >
    >> open(OUT, ">$indexname");

    >
    >You should *always* verify that the file was opened correctly before
    >trying to use what may be an invalid filehandle:
    >
    >open OUT, '>', $indexname or die "Cannot open '$indexname' because: $!";
    >
    >
    >> &update($path, $webpath);

    >
    >In modern versions of Perl you don't need to use ampersands on
    >subroutine calls:
    >
    >update($path, $webpath);
    >
    >
    >> sub update {
    >> my($path, $webpath) = @_;
    >> my($dd) = $nextFd++;

    >
    >Why are you storing a number in a variable that you are going to use for
    >a directory handle? That makes no sense.
    >
    >
    >> print "Updating in $path\n";
    >> if (!opendir($dd, $path)) {
    >> print STDERR "Warning: can't open $path\n";
    >> return;
    >> }

    >
    >You should declare variables where you first use them and you should
    >include $! in the error message so you know why it failed:
    >
    > opendir my $dd, $path or do {
    > warn "Warning: can't open '$path' because: $!";
    > return;
    > };
    >
    >
    >> while ($entry = readdir($dd)) {

    >
    > while ( my $entry = readdir $dd ) {
    >
    >
    >> if ($entry =~ /^\.$/) {
    >> next;
    >> }
    >>
    >> if ($entry =~ /^\.\.$/) {
    >> next;
    >> }

    >
    >Or simply:
    >
    > next if $entry =~ /\A\.\.?\z/;
    >
    >
    >> if (-d "$path/$entry") {
    >> &update("$path/$entry", "$webpath/$entry");
    >> next;
    >> }
    >> if (($entry !~ /.html$/i)&& ($entry !~ /.htm$/i)) {
    >> next;
    >> }

    >
    >You have to escape the period or it will match any character and you can
    >combine both regular expressions into one (same as example above):
    >
    > next if $entry !~ /\.html?$/i;
    >
    >
    >> my($fd) = $nextFd++;

    >
    >Why are you storing a number in a variable that you are going to use for
    >a filehandle? That makes no sense.
    >
    >
    >> if (!open($fd, "$path/$entry")) {
    >> print STDERR "Warning: can't open
    >> $path/$entry\n";
    >> next;
    >> }

    >
    >You should declare variables where you first use them and you should
    >include $! in the error message so you know why it failed:
    >
    > open my $fd, '<', "$path/$entry" or do {
    > warn "Warning: can't open '$path/$entry' because: $!";
    > next;
    > };
    >
    >
    >> my(%words) = ( );

    >
    >Or just:
    >
    > my %words;
    >
    >
    >> my($line);
    >> while ($line =<$fd>) {

    >
    >Or just:
    >
    > while ( my $line = <$fd> ) {
    >
    >
    >> # Support for turning off the search engine
    >> # indexer for parts of a page. These markers
    >> # must have a line to themselves. 3/13/00
    >> if ($line =~ /<\!\-\- SEARCH-ENGINE-OFF -->/)
    >> {
    >> while ($line =<$fd>) {
    >> if ($line =~ /<\!\-\-
    >> SEARCH-ENGINE-ON -->/) {
    >> last;
    >> }
    >> }
    >> next;
    >> }
    >> # Simple HTML flusher
    >> $line =~ s/\<.*?\>//g;
    >> # Case insensitive
    >> $line =~ tr/A-Z/a-z/;
    >> # If it's not a letter, it's whitespace
    >> $line =~ s/[^a-z]/ /g;

    >
    >You could also use tr/// for that:
    >
    > $line =~ tr/a-z/ /c;
    >
    >
    >> my(@words) = split(/\s+/, $line);

    >
    >That might be better as:
    >
    > my @words = split ' ', $line;
    >
    >
    >> my($p);
    >> for $p (@words) {

    >
    >Better as:
    >
    > for my $p ( @words ) {
    >
    >
    >> if (length($p)) {

    >
    >Why would $p have zero length? Probably because you are using /\s+/
    >instead of ' ' as the first argument to split which will give you a zero
    >length string if there is leading whitespace in $line.
    >
    >
    >> $words{$p}++;
    >> }
    >> }
    >> }
    >> print OUT "$webpath/$entry ";
    >> my($first) = 1;

    >
    >Why are you forcing list context on a scalar assignment?
    >
    >
    >> while (($key, $val) = each(%words)) {

    >
    >Better as:
    >
    > while ( my ( $key, $val ) = each %words ) {
    >
    >
    >> print OUT "$val:$key";
    >> if ($first) {
    >> $first = 0;
    >> } else {
    >> print OUT " ";
    >> }

    >
    >So you want no space between the first and second "$val:$key" but a
    >space after every other occurrence of "$val:$key" including at the end
    >of the line?
    >
    >
    >> }
    >> print OUT "\n";

    >
    >It looks like you could probably do that while loop like this instead:
    >
    > print OUT join( ' ', map "$words{$_}:$_", keys %words ), "\n";
    >
    >
    >> close($fd);
    >> }
    >> closedir($dd);
    >> }
    >> close(OUT);

    >
    >
    >
    >
    >John


    John,

    You have really made a lot of no dount useful comments but the code is
    not mine - it came from Tom Boutell's site and my only concern was to
    be able to avoid indexing some particular files/folders.

    Cheers

    Geoff
    , Mar 27, 2011
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jander
    Replies:
    0
    Views:
    366
    Jander
    Oct 31, 2005
  2. Alexander Malkis
    Replies:
    8
    Views:
    521
    Alexander Malkis
    Apr 14, 2004
  3. Kevin F
    Replies:
    0
    Views:
    303
    Kevin F
    Mar 23, 2006
  4. Roger23
    Replies:
    2
    Views:
    1,000
    Roger23
    Oct 12, 2006
  5. stumblng.tumblr
    Replies:
    1
    Views:
    202
    stumblng.tumblr
    Feb 4, 2008
Loading...

Share This Page