extract list from webpage

P

plb

Hello All,

I'm just starting to learn perl. Here's what I'm trying to do:

I have a webpage setup that displays all 'active' vhosts that run on
my webserver. This page is generated in ASP. I would like to write a
perl script that takes the list of 'active' vhosts and writes them to
a text file. Then I would like to compare this list to the wwwroot
directory on the webserver so I can delete any vhosts that are not in
use. There is about a 300-400 difference between the webpage and
wwwroot directory.

Ideally I would like the script to:
1) compile a list of active vhosts
2) compare this list to the "wwwroot" directory
3) delete any directory that is not on the list of active vhosts

I'm hoping someone can help me out getting started.

Thank you for your help,
plb
 
Z

Zak McGregor

Well, without some specific information regarding the format of the
information you are trying to parse, it is hard to make a good
recommendation. Have you looked at HTML::parser?

Sounds like s/he's actually trying to parse the httpd configuration
files. There is probably a module to do that, check CPAN.
http://search.cpan.org/

Ciao

Zak
 
Z

Zak McGregor

3) delete any directory
that is not on the list of active vhosts

May I humbly suggest not doing any actual deleting programmatically until
you're sure that your program works as expected _and_ you've fed it
unexpected input. The first few times you test, rather print the cammands
that would have been executed so that you can make sure it is not doing
anything unexpected.

Also, don't run this script with too liberal permissions. If you unlink
things as root for instance, you're tangoing with disaster.

Ciao

Zak
 
A

A. Sinan Unur

Sounds like s/he's actually trying to parse the httpd configuration
files. There is probably a module to do that, check CPAN.
http://search.cpan.org/

Well, the problem description does mention the information being in a web
page generated using ASP, and part of the objective seems to be to
extract the list of 'active' vhosts from that page.
 
B

Bob Walton

plb wrote:

....
I have a webpage setup that displays all 'active' vhosts that run on
my webserver. This page is generated in ASP. I would like to write a
perl script that takes the list of 'active' vhosts and writes them to
a text file. Then I would like to compare this list to the wwwroot
directory on the webserver so I can delete any vhosts that are not in
use. There is about a 300-400 difference between the webpage and
wwwroot directory.

Ideally I would like the script to:
1) compile a list of active vhosts
2) compare this list to the "wwwroot" directory
3) delete any directory that is not on the list of active vhosts ....


plb


Modules are your friend. You may

use LWP::Simple;

to get() your web page. It is very simple:

use LWP::Simple;
use warnings;
use strict;
my $webpage=get('http://some.server.com/some/path...');

Then you'll need to parse the HTML to retrieve the info you want:

use HTML::parser;

might help with that. Warning: Parsing HTML is much harder than it
looks at first glance -- that's why HTML::parser is there. Good luck.
 
P

plb

Bob Walton said:
plb wrote:

...


Modules are your friend. You may

use LWP::Simple;

to get() your web page. It is very simple:

use LWP::Simple;
use warnings;
use strict;
my $webpage=get('http://some.server.com/some/path...');

Then you'll need to parse the HTML to retrieve the info you want:

use HTML::parser;

might help with that. Warning: Parsing HTML is much harder than it
looks at first glance -- that's why HTML::parser is there. Good luck.


Okay, I've been playing around with this for a while.

The webpage I'm trying to access requires a username and password to
login. I tried using some code like this:

my $ua = LWP::UserAgent->new;
my $req = HTTP::Request->new(GET =>
'http://webadmin.foo.com/admin/VhostListSelect.asp');
$req->authorization_basic('username', 'password');
print $ua->request($req)->as_string;

However, that doesn't seem to work. I run the sript and I get this:

P:\>perl vhosts.pl
HTTP/1.1 200 OK
Cache-Control: private
Connection: Keep-Alive
Date: Fri, 25 Jul 2003 19:37:10 GMT
Server: Microsoft-IIS/5.0
Content-Length: 169
Content-Type: text/html
Client-Date: Fri, 25 Jul 2003 19:37:11 GMT
Client-Peer: 10.50.11.151:80
Set-Cookie: ASPSESSIONIDGGGGGBRY=GFHCEJJCMFDBMCBIHNCPDBJL; path=/

<HTML>
<BODY>
<CENTER><B><FONT FACE = "Arial" SIZE = "3">Access
Violation<br>You do no
t have access to this functionality.</FONT></B></CENTER>
</BODY>
</HTML>


I've looked at using HTML::parser, and it's pretty gross...at least
for someone who isn't very good with Perl.

What I was thinking I could do is this - Save the results from the ASP
page to an HTML file. Open that file, parse it with some sort of
regexp to get all the vhosts and dump that info into and array.
Create an array with all the info from the wwwroot directory, compare
the two using a hash and that's should do it. Does that sound
reasonable or is there any easier way.

I'm a little lost with all this.

Thanks for all the help,
plb
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top