recursively pull web site?

Mike · Jun 23, 2004

For the purpose of mirroring files on many nodes, I have set the files
under a web server and want to pull from the top of the structure
down getting all files. Is there an easy way to pass the beginning
url (http://server/RPMS/) to a method and have that method(s) pull
all files to the local node (keeping the directory structure from
the web server)?

This is to propagate configuration files, scripts, etc., to many boxes.

Mike

Mike · Jun 23, 2004

Runtime.getRuntime().exec("wget -m "+url);

Assuming, of course, that all the files you need are listed on HTML pages
(possibly directory listings generated by the server) reachable from
that first one.

And assuming that wget is installed on all my servers (unix, intel,
mainframe, etc.).

Michael Borgwardt · Jun 23, 2004

Mike said:
For the purpose of mirroring files on many nodes, I have set the files
under a web server and want to pull from the top of the structure
down getting all files. Is there an easy way to pass the beginning
url (http://server/RPMS/) to a method and have that method(s) pull
all files to the local node (keeping the directory structure from
the web server)?

Runtime.getRuntime().exec("wget -m "+url);

Assuming, of course, that all the files you need are listed on HTML pages
(possibly directory listings generated by the server) reachable from
that first one.

Andy Fish · Jun 23, 2004

you need a web spider or web robot.

since you're asking in here, I presume you want a java one. I have used jobo
which is free and seems to work OK but many others are available.

Whether it's an appropriate mechanism for mirroring software is another
question - I would probably prefer to tar/zip it up and then FTP it around

Andy

Michael Borgwardt · Jun 23, 2004

Mike said:
And assuming that wget is installed on all my servers (unix, intel,
mainframe, etc.).

Isn't it?

I'm pretty sure it would be less work than programming a web spider of your
own, but if there's one already done in java, that's of course even better.

Mike · Jun 23, 2004

you need a web spider or web robot.

since you're asking in here, I presume you want a java one. I have used jobo
which is free and seems to work OK but many others are available.

Whether it's an appropriate mechanism for mirroring software is another
question - I would probably prefer to tar/zip it up and then FTP it around

Andy

The tar.gz solution is fine for lots of things, but not incremental
changes to a production server farm. For major application changes,
even utility changes (sudo, lsof, cvs, etc.), I will use rpm since
I can get source for it and can compile it everywhere.

Thanks for the suggestions.

Mike

Roedy Green · Jun 23, 2004

you need a web spider or web robot.

Xenu is very quick at spidering and will produce reports on what it
found including broken links and orphaned files.

you would take its output and feed that into a mindless little program
that just downloaded the files it found one after another.

Xenu is quick because it uses many threads.

You could smarten your own beast up a bit by using several download
threads each feeding off a common queue.

See http://mindprod.com/projects/brokenlinkfixer.html

Mike · Jun 23, 2004

Xenu is very quick at spidering and will produce reports on what it
found including broken links and orphaned files.

you would take its output and feed that into a mindless little program
that just downloaded the files it found one after another.

Xenu is quick because it uses many threads.

You could smarten your own beast up a bit by using several download
threads each feeding off a common queue.

See http://mindprod.com/projects/brokenlinkfixer.html

Thanks, Roedy. I enjoy reading your posts. I'll look at Xenu.

Mike

Andrew Thompson · Jun 23, 2004

Is there an easy way to pass the beginning
url (http://server/RPMS/) to a method and have that method(s) pull
all files to the local node (keeping the directory structure from
the web server)?

This might serve as an example to get you started..
<http://groups.google.com/groups?as_q=PullUrl3 koran.html>

Mike · Jun 24, 2004

This might serve as an example to get you started..
<http://groups.google.com/groups?as_q=PullUrl3 koran.html>

Fantastic, thanks. A simple solution occurred to me while driving
home. Since I want to replicate files, my files from my web server,
the script that keeps the files current from the CVS repository
can easily do a 'find . -type f -print > files'. My program first
pulls the file 'files', then iterates through the contents pulling
each file mentioned.

Thanks for the help, everyone.

Mike

Thomas Weidenfeller · Jun 24, 2004

Andy said:
Whether it's an appropriate mechanism for mirroring software is another
question

rsync or if nothing else, rdist. No need to re-invent the wheel.

/Thomas

Roedy Green · Jun 24, 2004

rsync or if nothing else, rdist. No need to re-invent the wheel.

Rsync has two problems. It requires the Rsync software to run on a
webserver, and it uses its own protocol you have to arrange to tunnel
through firewalls. In some bureaucratic situations these can be
showstoppers.

I argued for Rsync with a researchers from the pharmaceutical
companies. They insisted I write something without these two
drawbacks. The result is called The Replicatior. See
http://mindprod.com/zips/java/replicator.html

Thomas Weidenfeller · Jun 28, 2004

Roedy said:
Rsync has two problems. It requires the Rsync software to run on a
webserver,

No, a remote shell account (ssh would be best, rsh would also do), is
all you need on the remote system. You could set up a remote sync
server, but you don't have to.

and it uses its own protocol

No, it uses the shell's protocol, unless you use a sync server.

you have to arrange to tunnel
through firewalls.

Well, if you have to remote administer a webserver, ssh is usually
permitted.

I argued for Rsync with a researchers from the pharmaceutical
companies.

If you told them the same as you told us here, well, yes, I can
understand why they argued.

/Thomas

Roedy Green · Jun 29, 2004

No, a remote shell account (ssh would be best, rsh would also do), is
all you need on the remote system. You could set up a remote sync
server, but you don't have to.

Even that can be hard to arrange. Most ISPs won't let you run any
software at all. All you get is a vanilla HTTP server running none of
your code. Life in a big company can be almost as restrictive.

Roedy Green · Jun 29, 2004

If you told them the same as you told us here, well, yes, I can
understand why they argued.

They could not run any code at all on their webservers. Further, all
clients were deeply behind a variety of corporate firewalls with no
political clout to get anything modified.

All it would take would be one client to be unable to communicate to
spoil the project.

I argued for running on a third party server, but that too was not
permitted. Part of the problem was the high confidentiality.

Roedy Green · Jun 29, 2004

If you told them the same as you told us here, well, yes, I can
understand why they argued.

You might want to check the rsync Java glossary entry for accuracy.

See http://mindprod.com/jgloss/rsync.html

Web publish locally	2	Mar 1, 2024
Simple web framework - improvements to makefile	0	Feb 1, 2023
English Idiom in Unix: Directory Recursively	108	May 17, 2011
translating an OS directory recursively into a tree object	8	Dec 7, 2009
Reorganizing Large Web Site	2	Jan 8, 2012
Help figuring out a directory permission change problem	1	May 12, 2023
External Server pull files from Internal Server	1	Aug 29, 2006
Big problem I need to solve with some unix utils	1	Jun 19, 2022

recursively pull web site?

Mike

Mike

Michael Borgwardt

Andy Fish

Michael Borgwardt

Mike

Roedy Green

Mike

Andrew Thompson

Mike

Thomas Weidenfeller

Roedy Green

Thomas Weidenfeller

Roedy Green

Roedy Green

Roedy Green

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads