recursively pull web site?

M

Mike

For the purpose of mirroring files on many nodes, I have set the files
under a web server and want to pull from the top of the structure
down getting all files. Is there an easy way to pass the beginning
url (http://server/RPMS/) to a method and have that method(s) pull
all files to the local node (keeping the directory structure from
the web server)?

This is to propagate configuration files, scripts, etc., to many boxes.

Mike
 
M

Mike

Runtime.getRuntime().exec("wget -m "+url);

Assuming, of course, that all the files you need are listed on HTML pages
(possibly directory listings generated by the server) reachable from
that first one.

And assuming that wget is installed on all my servers (unix, intel,
mainframe, etc.).
 
M

Michael Borgwardt

Mike said:
For the purpose of mirroring files on many nodes, I have set the files
under a web server and want to pull from the top of the structure
down getting all files. Is there an easy way to pass the beginning
url (http://server/RPMS/) to a method and have that method(s) pull
all files to the local node (keeping the directory structure from
the web server)?

Runtime.getRuntime().exec("wget -m "+url);

Assuming, of course, that all the files you need are listed on HTML pages
(possibly directory listings generated by the server) reachable from
that first one.
 
A

Andy Fish

you need a web spider or web robot.

since you're asking in here, I presume you want a java one. I have used jobo
which is free and seems to work OK but many others are available.

Whether it's an appropriate mechanism for mirroring software is another
question - I would probably prefer to tar/zip it up and then FTP it around

Andy
 
M

Michael Borgwardt

Mike said:
And assuming that wget is installed on all my servers (unix, intel,
mainframe, etc.).

Isn't it? :)

I'm pretty sure it would be less work than programming a web spider of your
own, but if there's one already done in java, that's of course even better.
 
M

Mike

you need a web spider or web robot.

since you're asking in here, I presume you want a java one. I have used jobo
which is free and seems to work OK but many others are available.

Whether it's an appropriate mechanism for mirroring software is another
question - I would probably prefer to tar/zip it up and then FTP it around

Andy

The tar.gz solution is fine for lots of things, but not incremental
changes to a production server farm. For major application changes,
even utility changes (sudo, lsof, cvs, etc.), I will use rpm since
I can get source for it and can compile it everywhere.

Thanks for the suggestions.

Mike
 
R

Roedy Green

you need a web spider or web robot.

Xenu is very quick at spidering and will produce reports on what it
found including broken links and orphaned files.

you would take its output and feed that into a mindless little program
that just downloaded the files it found one after another.

Xenu is quick because it uses many threads.

You could smarten your own beast up a bit by using several download
threads each feeding off a common queue.


See http://mindprod.com/projects/brokenlinkfixer.html
 
M

Mike

Xenu is very quick at spidering and will produce reports on what it
found including broken links and orphaned files.

you would take its output and feed that into a mindless little program
that just downloaded the files it found one after another.

Xenu is quick because it uses many threads.

You could smarten your own beast up a bit by using several download
threads each feeding off a common queue.


See http://mindprod.com/projects/brokenlinkfixer.html

Thanks, Roedy. I enjoy reading your posts. I'll look at Xenu.

Mike
 
M

Mike

This might serve as an example to get you started..
<http://groups.google.com/groups?as_q=PullUrl3 koran.html>

Fantastic, thanks. A simple solution occurred to me while driving
home. Since I want to replicate files, my files from my web server,
the script that keeps the files current from the CVS repository
can easily do a 'find . -type f -print > files'. My program first
pulls the file 'files', then iterates through the contents pulling
each file mentioned.

Thanks for the help, everyone.

Mike
 
T

Thomas Weidenfeller

Andy said:
Whether it's an appropriate mechanism for mirroring software is another
question

rsync or if nothing else, rdist. No need to re-invent the wheel.

/Thomas
 
R

Roedy Green

rsync or if nothing else, rdist. No need to re-invent the wheel.

Rsync has two problems. It requires the Rsync software to run on a
webserver, and it uses its own protocol you have to arrange to tunnel
through firewalls. In some bureaucratic situations these can be
showstoppers.

I argued for Rsync with a researchers from the pharmaceutical
companies. They insisted I write something without these two
drawbacks. The result is called The Replicatior. See
http://mindprod.com/zips/java/replicator.html
 
T

Thomas Weidenfeller

Roedy said:
Rsync has two problems. It requires the Rsync software to run on a
webserver,

No, a remote shell account (ssh would be best, rsh would also do), is
all you need on the remote system. You could set up a remote sync
server, but you don't have to.
and it uses its own protocol

No, it uses the shell's protocol, unless you use a sync server.
you have to arrange to tunnel
through firewalls.

Well, if you have to remote administer a webserver, ssh is usually
permitted.
I argued for Rsync with a researchers from the pharmaceutical
companies.

If you told them the same as you told us here, well, yes, I can
understand why they argued.

/Thomas
 
R

Roedy Green

No, a remote shell account (ssh would be best, rsh would also do), is
all you need on the remote system. You could set up a remote sync
server, but you don't have to.

Even that can be hard to arrange. Most ISPs won't let you run any
software at all. All you get is a vanilla HTTP server running none of
your code. Life in a big company can be almost as restrictive.
 
R

Roedy Green

If you told them the same as you told us here, well, yes, I can
understand why they argued.

They could not run any code at all on their webservers. Further, all
clients were deeply behind a variety of corporate firewalls with no
political clout to get anything modified.

All it would take would be one client to be unable to communicate to
spoil the project.

I argued for running on a third party server, but that too was not
permitted. Part of the problem was the high confidentiality.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top