recursively pull web site?

Discussion in 'Java' started by Mike, Jun 23, 2004.

  1. Mike

    Mike Guest

    For the purpose of mirroring files on many nodes, I have set the files
    under a web server and want to pull from the top of the structure
    down getting all files. Is there an easy way to pass the beginning
    url (http://server/RPMS/) to a method and have that method(s) pull
    all files to the local node (keeping the directory structure from
    the web server)?

    This is to propagate configuration files, scripts, etc., to many boxes.

    Mike
     
    Mike, Jun 23, 2004
    #1
    1. Advertising

  2. Mike

    Mike Guest

    In article <>, Michael Borgwardt wrote:
    > Mike wrote:
    >
    >> For the purpose of mirroring files on many nodes, I have set the files
    >> under a web server and want to pull from the top of the structure
    >> down getting all files. Is there an easy way to pass the beginning
    >> url (http://server/RPMS/) to a method and have that method(s) pull
    >> all files to the local node (keeping the directory structure from
    >> the web server)?

    >
    > Runtime.getRuntime().exec("wget -m "+url);
    >
    > Assuming, of course, that all the files you need are listed on HTML pages
    > (possibly directory listings generated by the server) reachable from
    > that first one.


    And assuming that wget is installed on all my servers (unix, intel,
    mainframe, etc.).
     
    Mike, Jun 23, 2004
    #2
    1. Advertising

  3. Mike wrote:

    > For the purpose of mirroring files on many nodes, I have set the files
    > under a web server and want to pull from the top of the structure
    > down getting all files. Is there an easy way to pass the beginning
    > url (http://server/RPMS/) to a method and have that method(s) pull
    > all files to the local node (keeping the directory structure from
    > the web server)?


    Runtime.getRuntime().exec("wget -m "+url);

    Assuming, of course, that all the files you need are listed on HTML pages
    (possibly directory listings generated by the server) reachable from
    that first one.
     
    Michael Borgwardt, Jun 23, 2004
    #3
  4. Mike

    Andy Fish Guest

    you need a web spider or web robot.

    since you're asking in here, I presume you want a java one. I have used jobo
    which is free and seems to work OK but many others are available.

    Whether it's an appropriate mechanism for mirroring software is another
    question - I would probably prefer to tar/zip it up and then FTP it around

    Andy


    "Mike" <> wrote in message
    news:...
    > For the purpose of mirroring files on many nodes, I have set the files
    > under a web server and want to pull from the top of the structure
    > down getting all files. Is there an easy way to pass the beginning
    > url (http://server/RPMS/) to a method and have that method(s) pull
    > all files to the local node (keeping the directory structure from
    > the web server)?
    >
    > This is to propagate configuration files, scripts, etc., to many boxes.
    >
    > Mike
     
    Andy Fish, Jun 23, 2004
    #4
  5. Mike wrote:
    >>Runtime.getRuntime().exec("wget -m "+url);
    >>
    >>Assuming, of course, that all the files you need are listed on HTML pages
    >>(possibly directory listings generated by the server) reachable from
    >>that first one.

    >
    >
    > And assuming that wget is installed on all my servers (unix, intel,
    > mainframe, etc.).


    Isn't it? :)

    I'm pretty sure it would be less work than programming a web spider of your
    own, but if there's one already done in java, that's of course even better.
     
    Michael Borgwardt, Jun 23, 2004
    #5
  6. Mike

    Mike Guest

    In article <7kiCc.779$>, Andy Fish wrote:
    > you need a web spider or web robot.
    >
    > since you're asking in here, I presume you want a java one. I have used jobo
    > which is free and seems to work OK but many others are available.
    >
    > Whether it's an appropriate mechanism for mirroring software is another
    > question - I would probably prefer to tar/zip it up and then FTP it around
    >
    > Andy
    >
    >
    > "Mike" <> wrote in message
    > news:...
    >> For the purpose of mirroring files on many nodes, I have set the files
    >> under a web server and want to pull from the top of the structure
    >> down getting all files. Is there an easy way to pass the beginning
    >> url (http://server/RPMS/) to a method and have that method(s) pull
    >> all files to the local node (keeping the directory structure from
    >> the web server)?
    >>
    >> This is to propagate configuration files, scripts, etc., to many boxes.
    >>
    >> Mike

    >
    >


    The tar.gz solution is fine for lots of things, but not incremental
    changes to a production server farm. For major application changes,
    even utility changes (sudo, lsof, cvs, etc.), I will use rpm since
    I can get source for it and can compile it everywhere.

    Thanks for the suggestions.

    Mike
     
    Mike, Jun 23, 2004
    #6
  7. Mike

    Roedy Green Guest

    On Wed, 23 Jun 2004 16:36:19 GMT, "Andy Fish"
    <> wrote or quoted :

    >you need a web spider or web robot.


    Xenu is very quick at spidering and will produce reports on what it
    found including broken links and orphaned files.

    you would take its output and feed that into a mindless little program
    that just downloaded the files it found one after another.

    Xenu is quick because it uses many threads.

    You could smarten your own beast up a bit by using several download
    threads each feeding off a common queue.


    See http://mindprod.com/projects/brokenlinkfixer.html

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 23, 2004
    #7
  8. Mike

    Mike Guest

    In article <>, Roedy Green wrote:
    > On Wed, 23 Jun 2004 16:36:19 GMT, "Andy Fish"
    ><> wrote or quoted :
    >
    >>you need a web spider or web robot.

    >
    > Xenu is very quick at spidering and will produce reports on what it
    > found including broken links and orphaned files.
    >
    > you would take its output and feed that into a mindless little program
    > that just downloaded the files it found one after another.
    >
    > Xenu is quick because it uses many threads.
    >
    > You could smarten your own beast up a bit by using several download
    > threads each feeding off a common queue.
    >
    >
    > See http://mindprod.com/projects/brokenlinkfixer.html
    >


    Thanks, Roedy. I enjoy reading your posts. I'll look at Xenu.

    Mike
     
    Mike, Jun 23, 2004
    #8
  9. On Wed, 23 Jun 2004 15:39:39 -0000, Mike wrote:

    > Is there an easy way to pass the beginning
    > url (http://server/RPMS/) to a method and have that method(s) pull
    > all files to the local node (keeping the directory structure from
    > the web server)?


    This might serve as an example to get you started..
    <http://groups.google.com/groups?as_q=PullUrl3%20koran.html>

    --
    Andrew Thompson
    http://www.PhySci.org/ Open-source software suite
    http://www.PhySci.org/codes/ Web & IT Help
    http://www.1point1C.org/ Science & Technology
     
    Andrew Thompson, Jun 23, 2004
    #9
  10. Mike

    Mike Guest

    In article <vonwkqe5mkzg$>, Andrew Thompson wrote:
    > On Wed, 23 Jun 2004 15:39:39 -0000, Mike wrote:
    >
    >> Is there an easy way to pass the beginning
    >> url (http://server/RPMS/) to a method and have that method(s) pull
    >> all files to the local node (keeping the directory structure from
    >> the web server)?

    >
    > This might serve as an example to get you started..
    ><http://groups.google.com/groups?as_q=PullUrl3%20koran.html>
    >


    Fantastic, thanks. A simple solution occurred to me while driving
    home. Since I want to replicate files, my files from my web server,
    the script that keeps the files current from the CVS repository
    can easily do a 'find . -type f -print > files'. My program first
    pulls the file 'files', then iterates through the contents pulling
    each file mentioned.

    Thanks for the help, everyone.

    Mike
     
    Mike, Jun 24, 2004
    #10
  11. Andy Fish wrote:
    > Whether it's an appropriate mechanism for mirroring software is another
    > question


    rsync or if nothing else, rdist. No need to re-invent the wheel.

    /Thomas
     
    Thomas Weidenfeller, Jun 24, 2004
    #11
  12. Mike

    Roedy Green Guest

    On Thu, 24 Jun 2004 08:54:38 +0200, Thomas Weidenfeller
    <> wrote or quoted :

    >
    >rsync or if nothing else, rdist. No need to re-invent the wheel.


    Rsync has two problems. It requires the Rsync software to run on a
    webserver, and it uses its own protocol you have to arrange to tunnel
    through firewalls. In some bureaucratic situations these can be
    showstoppers.

    I argued for Rsync with a researchers from the pharmaceutical
    companies. They insisted I write something without these two
    drawbacks. The result is called The Replicatior. See
    http://mindprod.com/zips/java/replicator.html

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 24, 2004
    #12
  13. Roedy Green wrote:
    > Rsync has two problems. It requires the Rsync software to run on a
    > webserver,


    No, a remote shell account (ssh would be best, rsh would also do), is
    all you need on the remote system. You could set up a remote sync
    server, but you don't have to.

    > and it uses its own protocol


    No, it uses the shell's protocol, unless you use a sync server.

    > you have to arrange to tunnel
    > through firewalls.


    Well, if you have to remote administer a webserver, ssh is usually
    permitted.

    > I argued for Rsync with a researchers from the pharmaceutical
    > companies.


    If you told them the same as you told us here, well, yes, I can
    understand why they argued.

    /Thomas
     
    Thomas Weidenfeller, Jun 28, 2004
    #13
  14. Mike

    Roedy Green Guest

    On Mon, 28 Jun 2004 10:12:00 +0200, Thomas Weidenfeller
    <> wrote or quoted :

    >No, a remote shell account (ssh would be best, rsh would also do), is
    >all you need on the remote system. You could set up a remote sync
    >server, but you don't have to.


    Even that can be hard to arrange. Most ISPs won't let you run any
    software at all. All you get is a vanilla HTTP server running none of
    your code. Life in a big company can be almost as restrictive.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 29, 2004
    #14
  15. Mike

    Roedy Green Guest

    On Mon, 28 Jun 2004 10:12:00 +0200, Thomas Weidenfeller
    <> wrote or quoted :

    >If you told them the same as you told us here, well, yes, I can
    >understand why they argued.


    They could not run any code at all on their webservers. Further, all
    clients were deeply behind a variety of corporate firewalls with no
    political clout to get anything modified.

    All it would take would be one client to be unable to communicate to
    spoil the project.

    I argued for running on a third party server, but that too was not
    permitted. Part of the problem was the high confidentiality.



    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 29, 2004
    #15
  16. Mike

    Roedy Green Guest

    On Mon, 28 Jun 2004 10:12:00 +0200, Thomas Weidenfeller
    <> wrote or quoted :

    >If you told them the same as you told us here, well, yes, I can
    >understand why they argued.


    You might want to check the rsync Java glossary entry for accuracy.

    See http://mindprod.com/jgloss/rsync.html


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 29, 2004
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. mike kim
    Replies:
    2
    Views:
    530
    clintonG
    Aug 20, 2003
  2. Jason
    Replies:
    1
    Views:
    566
    Chris Smith
    Apr 20, 2004
  3. Tiddley-Pom
    Replies:
    5
    Views:
    428
    Mark Parnell
    Oct 15, 2003
  4. Replies:
    2
    Views:
    2,168
    Mike Treseler
    Jun 28, 2006
  5. Gilles Ganault

    Concurrent threads to pull web pages?

    Gilles Ganault, Oct 1, 2009, in forum: Python
    Replies:
    5
    Views:
    338
Loading...

Share This Page