Search script to index dynamic pages

Discussion in 'Perl Misc' started by Rob, Mar 28, 2011.

  1. Rob

    Rob Guest

    I recently tried to add a downloaded CGI script to my site, before
    realising that it would naturally only index static pages on the site
    (i.e. only files that it could open using Perl routines). I have
    altered the indexing routine so that it does not crawl all directories
    and index all files. Instead it opens a file containing a specified
    list of files, and indexes only those instead.

    As the majority of the content on the website is dynamic content, does
    anybody know of any search CGI scripts that will index pages with
    dynamic CGI content? (e.g. "website.com/cgi-bin/viewpage.cgi?id=100" )

    If such a script is not available, is there a way to return content
    from a dynamic page to a CGI script (for indexing purposes)? I could
    alter the indexing routine so that it does not just open the files it
    is told to, but to return the content (that the script would output to
    the server) to the indexing routine instead. This would then suit the
    needs of the website perfectly.

    Apologies if this has already been covered elsewhere, I have had no
    success in finding a solution online. Any help with this would be much
    appreciated.

    Best Regards

    Rob
    Rob, Mar 28, 2011
    #1
    1. Advertising

  2. Rob

    Keith Keller Guest

    On 2011-03-28, Rob <> wrote:
    >
    > As the majority of the content on the website is dynamic content, does
    > anybody know of any search CGI scripts that will index pages with
    > dynamic CGI content? (e.g. "website.com/cgi-bin/viewpage.cgi?id=100" )


    Not directly, but you might consider using something like KinoSearch,
    which can create an index of anything you feed to it. You'd need to
    code up the indexer yourself, but it's fairly straightforward, assuming
    you have access to the backend content you're trying to index.

    --keith

    --
    -francisco.ca.us
    (try just my userid to email me)
    AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt
    see X- headers for PGP signature information
    Keith Keller, Mar 29, 2011
    #2
    1. Advertising

  3. Rob <> wrote:
    >As the majority of the content on the website is dynamic content, does
    >anybody know of any search CGI scripts that will index pages with
    >dynamic CGI content? (e.g. "website.com/cgi-bin/viewpage.cgi?id=100" )


    How would that script know which parameters are supported? Is id=100
    legal? Is id=100000000000000 legal? Is it myid=... instead of id=...?

    >If such a script is not available, is there a way to return content
    >from a dynamic page to a CGI script (for indexing purposes)?


    That is trivial, see
    perldoc -q "How do I fetch an HTML file?"

    jue
    Jürgen Exner, Mar 29, 2011
    #3
  4. Rob

    Rob Guest

    Thank you all for your responses.

    I have tried to download and index the pages using the script with the
    LWP module, but so far without success. I have done quite a lot of
    Perl programming in the past, but the LWP module is quite new to me.

    I have written a test routine just to see if it works but this has not
    worked either (below):

    ######################################
    #!/usr/bin/perl -w
    use strict;
    use LWP::Simple;

    $data = get("http://www.samlpesite.org");

    print $data;
    ######################################

    I just get server errors. The file permissions are correct and the LWP
    module is installed on the server. Have I missed something obvious, or
    used the 'get' routine incorrectly?

    Any help would be great.

    Regards

    Rob

    On Mar 29, 3:14 pm, J rgen Exner <> wrote:
    > Rob <> wrote:
    > >As the majority of the content on the website is dynamic content, does
    > >anybody know of any search CGI scripts that will index pages with
    > >dynamic CGI content? (e.g. "website.com/cgi-bin/viewpage.cgi?id=100" )

    >
    > How would that script know which parameters are supported? Is id=100
    > legal? Is id=100000000000000 legal? Is it myid=... instead of id=....?
    >
    > >If such a script is not available, is there a way to return content
    > >from a dynamic page to a CGI script (for indexing purposes)?

    >
    > That is trivial, see
    >         perldoc -q "How do I fetch an HTML file?"
    >
    > jue
    Rob, Mar 30, 2011
    #4
  5. Rob

    Rob Guest

    On Mar 30, 4:53 pm, Tad McClellan <> wrote:
    > Did it occur to you that the text of the error messages
    > might be helpful in debugging the problem?
    >


    Yes, this did occur to me- however my hosting provider does not give
    error logs. The code works when run on my own machine, but does not
    work correctly when run on the server. If I pass the information from
    a 'get' command to a variable it is simply left blank.


    > After it is working from the command line, *then* run it under
    > a web server.


    I am now able to 'get' a page when I run the script from my own
    computer - it successfully downloads the page and I can do what I like
    with the data. However when I run this script online it does not work
    at all. I have tried other techniques, such as WWW::Mechanize and
    LWP::UserAgent, neither of which produce better results.

    One thought was that there may have been firewall/bot protection on
    the server, but as the script works from my own computer then it
    should work from the server also?

    Many thanks for your help so far,

    Rob
    Rob, Mar 31, 2011
    #5
  6. Rob

    Willem Guest

    Rob wrote:
    ) On Mar 30, 4:53?pm, Tad McClellan <> wrote:
    )> Did it occur to you that the text of the error messages
    )> might be helpful in debugging the problem?
    )>
    )
    ) Yes, this did occur to me- however my hosting provider does not give
    ) error logs.

    Off the top of my head:

    BEGIN {
    print "Content-type: text/plain\n\n";
    $SIG{__WARN__} = sub { print @_ };
    $SIG(__DIE__} = sub { print @_ unless $^S };
    }

    Should make the error message go to the browser,
    and that should even work for compile-time errors.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
    Willem, Mar 31, 2011
    #6
  7. Rob

    Rob Guest

    On Mar 31, 8:22 pm, Willem <> wrote:

    > Off the top of my head:
    >
    > BEGIN {
    >         print "Content-type: text/plain\n\n";
    >         $SIG{__WARN__} = sub { print @_ };
    >         $SIG(__DIE__} = sub { print @_ unless $^S };
    >
    > }
    >
    > Should make the error message go to the browser,
    > and that should even work for compile-time errors.


    Thank you for this. When I run with this code the only error I get is
    that the variable '$data' is uninitialized (presumably because the
    'get' function has not succeeded in passing any data to it).

    The code now stands as follows (for the test file I have made):


    ################################################################################
    #!/usr/bin/perl -w
    use CGI qw:)all);
    use LWP::Simple;

    BEGIN {
    print "Content-type: text/plain\n\n";
    $SIG{__WARN__} = sub { print @_ };
    $SIG{__DIE__} = sub { print @_ unless $^S };
    }
    my $data = get("http://www.samplesite.org");

    open (CF2, "testtext.txt");
    print CF2 "$data";
    close(CF2);
    ################################################################################

    Like I said, it works on my computer but not on the server.

    Thanks,

    Rob
    Rob, Mar 31, 2011
    #7
  8. Rob

    Willem Guest

    Rob wrote:
    ) On Mar 31, 8:22?pm, Willem <> wrote:
    )
    )> Off the top of my head:
    )>
    )> BEGIN {
    )> ? ? ? ? print "Content-type: text/plain\n\n";
    )> ? ? ? ? $SIG{__WARN__} = sub { print @_ };
    )> ? ? ? ? $SIG(__DIE__} = sub { print @_ unless $^S };
    )>
    )> }
    )>
    )> Should make the error message go to the browser,
    )> and that should even work for compile-time errors.
    )
    ) Thank you for this. When I run with this code the only error I get is
    ) that the variable '$data' is uninitialized (presumably because the
    ) 'get' function has not succeeded in passing any data to it).

    Yes. LWP::Simple doesn't do error reporting. At all.
    You should use LWP::UserAgent if you want to know more than 'it failed'.

    ) The code now stands as follows (for the test file I have made):
    )
    )
    ) ################################################################################
    ) #!/usr/bin/perl -w
    ) use CGI qw:)all);
    ) use LWP::Simple;

    use strict;
    use warnings;

    )
    ) BEGIN {
    ) print "Content-type: text/plain\n\n";
    ) $SIG{__WARN__} = sub { print @_ };
    ) $SIG{__DIE__} = sub { print @_ unless $^S };
    ) }
    ) my $data = get("http://www.samplesite.org");
    )
    ) open (CF2, "testtext.txt");
    ) print CF2 "$data";
    ) close(CF2);

    # Use lexical filehandles.

    ) ################################################################################
    )
    ) Like I said, it works on my computer but not on the server.

    Dump LWP::Simple, and code it using LWP::UserAgent

    use LWP::UserAgent;
    my $response = LWP::UserAgent->new->get("http://www.samplesite.org");
    if ($response->is_success) {
    open (my $cf, '>', 'testtext.txt') or die "Failed to write: $!";
    print $cf $response->decoded_content;
    close $cf;
    } else {
    die $response->status_line;
    }


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
    Willem, Mar 31, 2011
    #8
  9. Rob

    Rob Guest

    On Mar 31, 8:50 pm, Willem <> wrote:

    > Dump LWP::Simple, and code it using LWP::UserAgent
    >
    > use LWP::UserAgent;


    That has helped to show me that it is a matter of the request timing
    out. I imagine this may be because of a security feature which is not
    allowing a script within the site to download itself? I have tried
    changing the IP address of the UserAgent using:

    $ua->local_address("10.10.10.10");

    but I am told that the method "local_address" can't be located through
    the UserAgent package. After looking this up, it would appear that the
    version of Perl is out of date on the server.

    Rob
    Rob, Mar 31, 2011
    #9
  10. Willem wrote:
    > Rob wrote:
    > ) On Mar 30, 4:53?pm, Tad McClellan <> wrote:
    > )> Did it occur to you that the text of the error messages
    > )> might be helpful in debugging the problem?
    > )>
    > )
    > ) Yes, this did occur to me- however my hosting provider does not give
    > ) error logs.


    Holy cow. Do you pay for this hosting provider?

    >
    > Off the top of my head:
    >
    > BEGIN {
    > print "Content-type: text/plain\n\n";
    > $SIG{__WARN__} = sub { print @_ };
    > $SIG(__DIE__} = sub { print @_ unless $^S };
    > }
    >
    > Should make the error message go to the browser,
    > and that should even work for compile-time errors.


    As long the compile time error occurs after the BEGIN.

    I
    use CGI::Carp qw(fatalsToBrowser);

    Doesn't deal with the warnings, but that module provides other ways to
    do that.

    Xho
    Xho Jingleheimerschmidt, Apr 1, 2011
    #10
  11. Rob

    J. Gleixner Guest

    Rob wrote:
    [...]
    > ################################################################################
    > #!/usr/bin/perl -w
    > use CGI qw:)all);
    > use LWP::Simple;
    >
    > BEGIN {
    > print "Content-type: text/plain\n\n";
    > $SIG{__WARN__} = sub { print @_ };
    > $SIG{__DIE__} = sub { print @_ unless $^S };
    > }
    > my $data = get("http://www.samplesite.org");
    >
    > open (CF2, "testtext.txt");

    Ahhhh.. that's (possibly) opening 'testtext.txt' for reading.


    > print CF2 "$data";

    Not writing.

    Add some error checking and open it for write.

    > close(CF2);
    > ################################################################################
    >
    > Like I said, it works on my computer but not on the server.

    Doubtful, unless 'works' means it doesn't create a file.
    J. Gleixner, Apr 1, 2011
    #11
  12. Rob

    Rob Guest

    On Apr 1, 3:09 pm, "J. Gleixner" <>
    wrote:

    >
    > Ahhhh.. that's (possibly) opening 'testtext.txt' for reading.


    I had deleted the path for the purposes of posting it to this forum
    and in the process deleted the ">" at the start, my mistake!

    > Doubtful, unless 'works' means it doesn't create a file.


    It is creating a file, but leaving it empty - I'm awaiting a response
    from my host now about their perl version.

    Rob
    Rob, Apr 1, 2011
    #12
  13. Rob

    J. Gleixner Guest

    Rob wrote:
    > On Apr 1, 3:09 pm, "J. Gleixner" <>
    > wrote:
    >
    >> Ahhhh.. that's (possibly) opening 'testtext.txt' for reading.

    >
    > I had deleted the path for the purposes of posting it to this forum
    > and in the process deleted the ">" at the start, my mistake!


    OK.. and what if the open failed???? Add some simple error checking!

    >
    >> Doubtful, unless 'works' means it doesn't create a file.

    >
    > It is creating a file, but leaving it empty - I'm awaiting a response
    > from my host now about their perl version.


    Why does your version of perl matter?

    Since you're saying it works when you run it, but not when executed
    as a CGI, than the first thing I'd look at is a permission
    problem. Back-up a bit, and simplify your problem. Change your
    script to simply open the file you want to write to (for write),
    print something if open fails, and write something to it. That's all.
    Does that work?

    A very basic example:

    use CGI qw( header );
    print header;
    my $file = '/some/path/to/file';
    if ( open( my $fh, '>', $file ) )
    {
    print $fh 'Testing 123';
    close $fh;

    open( my $o, '<', $file );
    print "content of $file: ", <$o>;
    close( $o );

    }
    else
    {
    print "opening $file for write failed: $!";
    }
    J. Gleixner, Apr 1, 2011
    #13
  14. Rob

    Willem Guest

    Rob wrote:
    ) On Mar 31, 8:50?pm, Willem <> wrote:
    )
    )> Dump LWP::Simple, and code it using LWP::UserAgent
    )>
    )> use LWP::UserAgent;
    )
    ) That has helped to show me that it is a matter of the request timing
    ) out. I imagine this may be because of a security feature which is not
    ) allowing a script within the site to download itself?

    Sounds unlikely. Especially if you're getting a timeout.
    Timeouts usually mean firewalls silently dropping packets.

    Have you tried using 'localhost' in place of the server name ?

    ) I have tried
    ) changing the IP address of the UserAgent using:
    )
    ) $ua->local_address("10.10.10.10");

    That's not likely to help, IMO.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
    Willem, Apr 1, 2011
    #14
  15. Rob

    Rob Guest

    On Apr 1, 5:12 pm, "J. Gleixner" <>
    wrote:
    > OK.. and what if the open failed????  Add some simple error checking!


    The open didn't fail, it left me with an empty text file. The only
    error I get when running the script is a timeout.

    > Why does your version of perl matter?


    The version of Perl matters because I am trying to use LWP::UserAgent,
    and the version of LWP::UserAgent on my server does not apparently
    include the local_address function. I was hoping to use this as
    currently when I try to get a page (using LWP::UserAgent) it is timing
    out when running from the server, but working when I run from my own
    computer. As this is an indexing routine, I would like it to work from
    the server.

    > Back-up a bit, and simplify your problem. Change your
    > script to simply open the file you want to write to (for write),
    > print something if open fails, and write something to it. That's all.
    > Does that work?


    Thanks for the example but it is not a file permission problem, I have
    many scripts which read and write files on this server, none of which
    are problematic.

    Rob
    Rob, Apr 1, 2011
    #15
  16. Rob

    Jim Gibson Guest

    In article
    <>,
    Rob <> wrote:

    > On Mar 31, 8:50 pm, Willem <> wrote:
    >
    > but I am told that the method "local_address" can't be located through
    > the UserAgent package. After looking this up, it would appear that the
    > version of Perl is out of date on the server.


    Here is a program I stole many years ago to print out server Perl
    information. It probably depends upon the server being linux, but it
    may be possible to adapt it to other environments:


    #!/usr/bin/perl -T
    use strict;
    use File::Find;
    use File::Basename;

    my $debug;

    # untaint path
    $ENV{'PATH'} = '/usr/bin';

    # get perl version
    my $perlout = `perl -v`;
    my $perlver;
    if( $perlout =~ m/This is perl, v([\d.]+) built for/s ) {
    $perlver = $1;
    }


    print "Content-type: text/html\n\n";
    print "<HTML>\n";
    print "<HEAD><TITLE>Perl Environment</TITLE></HEAD>\n";
    print "<BODY>\n";
    print "<h1>Perl Version:</h1>\n";
    print "Perl version is $perlver<br>\n" if $perlver;
    print "<pre>\n";
    print "$perlout\n";
    print `perl -V`;
    print "</pre>\n";
    print "<H1>Perl Modules Installed:</H1>\n";
    my( %modules, %seen );
    my @subdirs = qw( i386-linux-thread-multi i686-linux );
    my( $dirlen, $curdir);
    for my $incdir ( @INC ) {
    $curdir = $incdir;
    $dirlen = length($incdir);
    print "\n<br>Look in $incdir ($dirlen):<br>\n\n" if $debug;
    find( {wanted=>\&add_module, no_chdir=>1}, $incdir);
    }

    print "<p><table border=1 cellspacing=2 cellpadding=4>\n";
    print "<tr><th>Module</th><th>Location</th></tr>\n";
    foreach my $file ( sort keys %modules ) {
    print "<tr><td>$file</td><td>$modules{$file}</td></tr>\n";
    }
    print "</table>\n";
    print "</BODY></HTML>\n";
    exit (0);

    sub add_module
    {
    # only include once
    return if $seen{$File::Find::name}++;

    # only include Perl modules ending with '.pm'
    return unless /\.pm$/;

    # eliminate unless belongs to active Perl version
    if( $perlver ) {
    return unless /$perlver/;
    }

    #return unless /site/;
    print " found $_<br>\n" if $debug;
    my $name;
    $name = substr($File::Find::name,$dirlen+1,-3);
    my $loc = substr($File::Find::name,0,$dirlen);
    print "name=$name, loc=$loc<br>\n" if $debug;
    $name =~ s/\//::/g;
    print " saving &quot;$name&quot; => &quot;$loc&quot;<br>\n"
    if $debug;
    $modules{$name} = $loc;

    }

    --
    Jim Gibson
    Jim Gibson, Apr 1, 2011
    #16
  17. Rob

    J. Gleixner Guest

    Rob wrote:
    > On Apr 1, 5:12 pm, "J. Gleixner" <>
    > wrote:
    >> OK.. and what if the open failed???? Add some simple error checking!

    >
    > The open didn't fail, it left me with an empty text file. The only
    > error I get when running the script is a timeout.
    >
    >> Why does your version of perl matter?

    >
    > The version of Perl matters because I am trying to use LWP::UserAgent,
    > and the version of LWP::UserAgent on my server does not apparently
    > include the local_address function. I was hoping to use this as
    > currently when I try to get a page (using LWP::UserAgent) it is timing
    > out when running from the server, but working when I run from my own
    > computer.


    Still the version of perl really doesn't matter. The version of
    that module might, however you can find that version yourself:

    use LWP::UserAgent;
    print $LWP::UserAgent::VERSION;

    or to see if a method 'can' be called:

    my $lwp = LWP::UserAgent->new();
    print "local_address is available." if $lwp->can( 'local_address' );


    > As this is an indexing routine, I would like it to work from
    > the server.




    No idea why you want to 'index' something through a CGI, but.... Do you
    have shell access to the server? If you do, then connect to that
    machine, using ssh/telnet/whatever, and do everything from that machine.
    Try 'telnet localhost 80' Do you get a connection? I guess the server
    could not be allowing connections to port 80 from itself. If you do
    get a connection, then using LWP::Debug might help figure out the
    problem, or use the debugger and step through your program, to
    see what's happening, or not happening. There are a lot of possible
    problems, working with someone who owns the machine is probably your
    best bet.


    >
    >> Back-up a bit, and simplify your problem. Change your
    >> script to simply open the file you want to write to (for write),
    >> print something if open fails, and write something to it. That's all.
    >> Does that work?

    >
    > Thanks for the example but it is not a file permission problem, I have
    > many scripts which read and write files on this server, none of which
    > are problematic.


    OK. I haven't been following this very closely. That's usually a very
    common problem.
    J. Gleixner, Apr 1, 2011
    #17
  18. Rob <> writes:
    >> Why does your version of perl matter?

    >
    > The version of Perl matters because I am trying to use LWP::UserAgent,
    > and the version of LWP::UserAgent on my server does not apparently
    > include the local_address function.



    Would not then, the version of LWP::UserAgent (and it's ascendants) be
    a much more interesting datum?

    --L
    Lawrence Statton, Apr 1, 2011
    #18
  19. Rob <> wrote:
    >I am now able to 'get' a page when I run the script from my own
    >computer - it successfully downloads the page and I can do what I like
    >with the data.


    Great! This means your Perl problem is solved.

    Anything else is something else, like e.g. web server config issues,
    missing modules, incorrect use of CGI, ... that list goes on and on.

    jue
    Jürgen Exner, Apr 2, 2011
    #19
  20. Rob

    Rob Guest

    On Apr 2, 3:59 am, J rgen Exner <> wrote:

    > Great! This means your Perl problem is solved.


    Thank you all for your advice and help with this. It turned out that
    the real problem was the firewall on the server, I can't do any http
    communication using scripts within the site. I have overcome this by
    running the indexer on my own machine and uploading the idexed
    database each time, which now works fine.

    Rob
    Rob, Apr 8, 2011
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Guest
    Replies:
    0
    Views:
    411
    Guest
    May 21, 2004
  2. Jason
    Replies:
    2
    Views:
    363
    Zak McGregor
    Nov 3, 2003
  3. Abby Lee
    Replies:
    5
    Views:
    376
    Abby Lee
    Aug 2, 2004
  4. Tomasz Chmielewski

    sorting index-15, index-9, index-110 "the human way"?

    Tomasz Chmielewski, Mar 4, 2008, in forum: Perl Misc
    Replies:
    4
    Views:
    268
    Tomasz Chmielewski
    Mar 4, 2008
  5. manish
    Replies:
    5
    Views:
    155
    Bart Van der Donck
    Nov 8, 2008
Loading...

Share This Page