Copying Website Contents, esp. Message Boards

  • Thread starter cmashieldscapting
  • Start date
C

cmashieldscapting

How would a Macintosh user best go about the following:

--Copying the entire contents of a message board composed of forums.
Each forum has threads that must be opened to see the messages, with
longer threads divided into pages.

--Copying messages from the Google version of Usenet, in which messages
are organized into threads, with the contents of a certain number of
messages visible per page.

--Copying messages from Yahoo! groups, which shows each individual
message in the order it came in. The messages appear on pages, and
each message has to be opened to be read.

What I especially need is a way to copy the entire contents of messages
in large amounts without having to open each message--can one or more
pages be copied at a time in certain formats?--and to save them by
backing them up on CD, hopefully in as near as possible the form in
which they appear in their natural habitat. Of course I'd have to have
a good idea of what will fit on a CD, so a very large forum may need to
be divided between several CDs.

I am also interested in copying other website content, such as text,
pictures, sound, and moving images. Thanks for any help and advice.

Cori
 
R

ray

How would a Macintosh user best go about the following:

--Copying the entire contents of a message board composed of forums.
Each forum has threads that must be opened to see the messages, with
longer threads divided into pages.

As the pages you describe are generated as a result of queries to a
database, I'd say you can't. Without the query the page doesn't exist.
I am also interested in copying other website content, such as text,
pictures, sound, and moving images.

The contents of a web page are stored in the browser cache on your
computer, that's how you get to see it.
There are applications that enable you to download entire static
websites without actually visiting them. WebGrabber and SiteSucker come
to mind.
 
B

Barbara de Zoete

There are applications that enable you to download entire static
websites without actually visiting them. WebGrabber and SiteSucker come
to mind.

I very much dislike to see someone using those applications on my site.
Downloading over a hundred pages and all that comes with them in a short
time, for what? I'm sure the one who does that is not going to read all
the stuff that just got downloaded, which means it is just a waste of
bandwidth.
If I spot the same IP doing that more than once (yup, there are those
people) or if I notice that it is a commercial enterprise that does that,
the IP gets banned. I wish there was a way to block these grabbers
altogether.
 
P

Phil Earnhardt

As the pages you describe are generated as a result of queries to a
database, I'd say you can't. Without the query the page doesn't exist.

If the queries are wired into the HTML links of the pages you wish to
grab, the automated tools to recursively capture an entire website may
be able to pull them down.

Even if you could do that, I'm not sure what you would do once you got
them.
The contents of a web page are stored in the browser cache on your
computer, that's how you get to see it.
There are applications that enable you to download entire static
websites without actually visiting them. WebGrabber and SiteSucker come
to mind.

curl ships with OS X. Bring up a terminal window and do

man curl

to see what's available.

--phil
 
P

Phil Earnhardt

I very much dislike to see someone using those applications on my site.
Downloading over a hundred pages and all that comes with them in a short
time, for what? I'm sure the one who does that is not going to read all
the stuff that just got downloaded, which means it is just a waste of
bandwidth.
If I spot the same IP doing that more than once (yup, there are those
people) or if I notice that it is a commercial enterprise that does that,
the IP gets banned. I wish there was a way to block these grabbers
altogether.

I can't imagine how you would categorically block them. OTOH, the
Robots Exclusion Protocol can be used to tell anyone who honors such
things that you don't want your website copied.

--phil
 
C

cmashieldscapting

Phil said:
curl ships with OS X. Bring up a terminal window and do

man curl

to see what's available.

--phil

Thanks, Phil, that's the sort of answer I was looking for. Already a
couple of times when I was trying to do things, I found I had things I
didn't know I had, or things I did know I had did things I didn't know
they did.

Cori
 
C

cmashieldscapting

Phil said:
I can't imagine how you would categorically block them. OTOH, the
Robots Exclusion Protocol can be used to tell anyone who honors such
things that you don't want your website copied.

--phil

Try doing a search using key topic words on Usenet, or at least Google
Groups, the Google version of Usenet. I'm pretty sure I saw a couple
of discussions regarding this.

Cori
 
B

Barbara de Zoete

[ like to block grabbers/downloaders and the like ]

Try doing a search using key topic words on Usenet, or at least Google
Groups, the Google version of Usenet. I'm pretty sure I saw a couple
of discussions regarding this.

I found this site <URL:http://www.psychedelix.com/agents/index.shtml> that
lists the 'handles' (don't know a better word for that right now) of know
bots and user agents et cetera. I singled out the ones marked with
D(ownload) and S(pam bot or other bad bot) and put them in my httpd.ini
file[1] to redirect them into nothing. This is all relatively new to me
though, so I'll have to see in my logs if the next Mget try succeeds, or
not :)


[1] I'm on a IIS server
 
G

Greg N.

Barbara said:
I very much dislike to see someone using those applications on my site.
Downloading over a hundred pages and all that comes with them in a
short time, for what?

I think there are reasonable applications for this type of thing. One
that I heard about is something like "off-line mobile internet" (I
forget the correct term).

Here is how it works. You define a list of pet URLs, either explicitly
or in wildcard fashion. You connect to the web over night, and it sucks
in and buffers all the new and _changed_ pages on these sites. When you
leave home in the morning, your laptop contains the latest version of
everything you need for you to surf off-line.

Sounds like a reasonable app to me. I'd be flattered if somebody chose
my site to be among his mobile pet site list.
 
P

Phil Earnhardt

You'd better not try that on a wpoison web site! ;-)
http://www.monkeys.com/wpoison/

Go look at the "safety" page on that site.

wpoison uses the Robot Exclusion Protocol already discussed here; only
programs that ignore the robots.txt guidelines that should wind up in
an infinite maze of twisty passages -- all different.

Now, it's a certainty that there are poisoned sites that don't honor
the REP; one certainly does have to be careful doing such things. And,
you're right: in general, it's a pretty pointless (and potentially
risky) operation to go around grabbing copies of websites.

--phil
 
A

Alan J. Flavell

[I've proposed f'ups to what seems the least off-topic group...]

On Tue, 7 Feb 2006 19:28:31 +0000, "Alan J. Flavell"


Go look at the "safety" page on that site.

By all means. My own references to a wpoison server have done no harm
to the fact that my URLs seem well-indexed at the bona fide search
services.
wpoison uses the Robot Exclusion Protocol already discussed here;
only programs that ignore the robots.txt guidelines that should wind
up in an infinite maze of twisty passages -- all different.
Right

Now, it's a certainty that there are poisoned sites that don't honor
the REP;

And indeed we see them merrily trawling away, in the logs of that
wpoison server. Not just the initial wpoison page to which I'm
pointing, but then recursing their way through the "twisty passages"
to which you refer.

My links have been in place, unchanged, for well over a year (in fact,
I have to confess that I had forgotten all about them after a while,
and only recently remembered they were there), and the logs on the
wpoison server show that the address-trawlers haven't tired of the fun
yet.

cheers
 
B

Barbara de Zoete

I think there are reasonable applications for this type of thing.

There is always the exception.
Sounds like a reasonable app to me. I'd be flattered if somebody chose
my site to be among his mobile pet site list.

Just like it is kind of flattering to see your content get stolen (yes,
happened to me once with a page I was particularly fond off). Doesn't mean
I just let it be.
 
B

Barbara de Zoete

There is always the exception.

Which isn't that phone tingy you mentioned though. I hardly believe one
would download over a hundred pages to a mobile phone (or a notebook for
that matter).
 
G

Greg N.

Which isn't that phone tingy you mentioned though. I hardly believe one
would download over a hundred pages to a mobile phone (or a notebook
for that matter).

I did not mention "phone". I said "mobile" like in "mobile computter".

I think the thing I described is indeed used on laptops, to be used on
the road. IIRC, it's mostly used with biz related intranet stuff, but
it can (and is) also be used with any other sites.
 
J

Jochem Huhmann

Barbara de Zoete said:
Which isn't that phone tingy you mentioned though. I hardly believe one
would download over a hundred pages to a mobile phone (or a notebook for
that matter).

Is archiving a valid purpose? As http://www.archive.org does?

I think when you're publishing anything on the WWW you have to accept
the fact that people or machines look at and download the stuff.


Jochem
 
B

Barbara de Zoete

Is archiving a valid purpose? As http://www.archive.org does?

I think when you're publishing anything on the WWW you have to accept
the fact that people or machines look at and download the stuff.

Look at, downloading while browsing. That's fine, but, hey, I pay the
bandwidth, so I get to decide how it can be used. I don't allow
deeplinking to images, I don't want people to download a random hundred
pages just to look at a dozen of them or so, and throw the rest out.
_Don't_waste_ my bandwidht, _use_ it.
 
B

Barry Margolin

Phil Earnhardt said:
I can't imagine how you would categorically block them. OTOH, the
Robots Exclusion Protocol can be used to tell anyone who honors such
things that you don't want your website copied.

I wouldn't expect a manual download application to honor it. That
mechanism is intended to control automated web crawlers, like the ones
that Google uses to index all of the web.
 
P

Phil Earnhardt

I wouldn't expect a manual download application to honor it. That
mechanism is intended to control automated web crawlers, like the ones
that Google uses to index all of the web.

wget respects the Robot Exclusion Protocol; curl does not.

--phil
 
T

Travis Newbury

Barbara said:
I very much dislike to see someone using those applications on my site.
Downloading over a hundred pages and all that comes with them in a short
time, for what? I'm sure the one who does that is not going to read all
the stuff that just got downloaded, which means it is just a waste of
bandwidth....

Snore....
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top