Copying Website Contents, esp. Message Boards

cmashieldscapting · Feb 7, 2006

How would a Macintosh user best go about the following:

--Copying the entire contents of a message board composed of forums.
Each forum has threads that must be opened to see the messages, with
longer threads divided into pages.

--Copying messages from the Google version of Usenet, in which messages
are organized into threads, with the contents of a certain number of
messages visible per page.

--Copying messages from Yahoo! groups, which shows each individual
message in the order it came in. The messages appear on pages, and
each message has to be opened to be read.

What I especially need is a way to copy the entire contents of messages
in large amounts without having to open each message--can one or more
pages be copied at a time in certain formats?--and to save them by
backing them up on CD, hopefully in as near as possible the form in
which they appear in their natural habitat. Of course I'd have to have
a good idea of what will fit on a CD, so a very large forum may need to
be divided between several CDs.

I am also interested in copying other website content, such as text,
pictures, sound, and moving images. Thanks for any help and advice.

Cori

ray · Feb 7, 2006

How would a Macintosh user best go about the following:

--Copying the entire contents of a message board composed of forums.
Each forum has threads that must be opened to see the messages, with
longer threads divided into pages.

As the pages you describe are generated as a result of queries to a
database, I'd say you can't. Without the query the page doesn't exist.

I am also interested in copying other website content, such as text,
pictures, sound, and moving images.

The contents of a web page are stored in the browser cache on your
computer, that's how you get to see it.
There are applications that enable you to download entire static
websites without actually visiting them. WebGrabber and SiteSucker come
to mind.

Barbara de Zoete · Feb 7, 2006

There are applications that enable you to download entire static
websites without actually visiting them. WebGrabber and SiteSucker come
to mind.

I very much dislike to see someone using those applications on my site.
Downloading over a hundred pages and all that comes with them in a short
time, for what? I'm sure the one who does that is not going to read all
the stuff that just got downloaded, which means it is just a waste of
bandwidth.
If I spot the same IP doing that more than once (yup, there are those
people) or if I notice that it is a commercial enterprise that does that,
the IP gets banned. I wish there was a way to block these grabbers
altogether.

Phil Earnhardt · Feb 7, 2006

As the pages you describe are generated as a result of queries to a
database, I'd say you can't. Without the query the page doesn't exist.

If the queries are wired into the HTML links of the pages you wish to
grab, the automated tools to recursively capture an entire website may
be able to pull them down.

Even if you could do that, I'm not sure what you would do once you got
them.

The contents of a web page are stored in the browser cache on your
computer, that's how you get to see it.
There are applications that enable you to download entire static
websites without actually visiting them. WebGrabber and SiteSucker come
to mind.

curl ships with OS X. Bring up a terminal window and do

man curl

to see what's available.

--phil

Phil Earnhardt · Feb 7, 2006

I very much dislike to see someone using those applications on my site.
Downloading over a hundred pages and all that comes with them in a short
time, for what? I'm sure the one who does that is not going to read all
the stuff that just got downloaded, which means it is just a waste of
bandwidth.
If I spot the same IP doing that more than once (yup, there are those
people) or if I notice that it is a commercial enterprise that does that,
the IP gets banned. I wish there was a way to block these grabbers
altogether.

I can't imagine how you would categorically block them. OTOH, the
Robots Exclusion Protocol can be used to tell anyone who honors such
things that you don't want your website copied.

--phil

Alan J. Flavell · Feb 7, 2006

If the queries are wired into the HTML links of the pages you wish
to grab, the automated tools to recursively capture an entire
website may be able to pull them down.

You'd better not try that on a wpoison web site! ;-)
http://www.monkeys.com/wpoison/

cmashieldscapting · Feb 7, 2006

Phil said:
curl ships with OS X. Bring up a terminal window and do

man curl

to see what's available.

--phil

Thanks, Phil, that's the sort of answer I was looking for. Already a
couple of times when I was trying to do things, I found I had things I
didn't know I had, or things I did know I had did things I didn't know
they did.

Cori

cmashieldscapting · Feb 7, 2006

Phil said:
I can't imagine how you would categorically block them. OTOH, the
Robots Exclusion Protocol can be used to tell anyone who honors such
things that you don't want your website copied.

--phil

Try doing a search using key topic words on Usenet, or at least Google
Groups, the Google version of Usenet. I'm pretty sure I saw a couple
of discussions regarding this.

Cori

Barbara de Zoete · Feb 7, 2006

[ like to block grabbers/downloaders and the like ]

Try doing a search using key topic words on Usenet, or at least Google
Groups, the Google version of Usenet. I'm pretty sure I saw a couple
of discussions regarding this.

I found this site <URL:http://www.psychedelix.com/agents/index.shtml> that
lists the 'handles' (don't know a better word for that right now) of know
bots and user agents et cetera. I singled out the ones marked with
D(ownload) and S(pam bot or other bad bot) and put them in my httpd.ini
file[1] to redirect them into nothing. This is all relatively new to me
though, so I'll have to see in my logs if the next Mget try succeeds, or
not

[1] I'm on a IIS server

Greg N. · Feb 7, 2006

Barbara said:
I very much dislike to see someone using those applications on my site.
Downloading over a hundred pages and all that comes with them in a
short time, for what?

I think there are reasonable applications for this type of thing. One
that I heard about is something like "off-line mobile internet" (I
forget the correct term).

Here is how it works. You define a list of pet URLs, either explicitly
or in wildcard fashion. You connect to the web over night, and it sucks
in and buffers all the new and _changed_ pages on these sites. When you
leave home in the morning, your laptop contains the latest version of
everything you need for you to surf off-line.

Sounds like a reasonable app to me. I'd be flattered if somebody chose
my site to be among his mobile pet site list.

Phil Earnhardt · Feb 7, 2006

You'd better not try that on a wpoison web site! ;-)
http://www.monkeys.com/wpoison/

Go look at the "safety" page on that site.

wpoison uses the Robot Exclusion Protocol already discussed here; only
programs that ignore the robots.txt guidelines that should wind up in
an infinite maze of twisty passages -- all different.

Now, it's a certainty that there are poisoned sites that don't honor
the REP; one certainly does have to be careful doing such things. And,
you're right: in general, it's a pretty pointless (and potentially
risky) operation to go around grabbing copies of websites.

--phil

Alan J. Flavell · Feb 7, 2006

[I've proposed f'ups to what seems the least off-topic group...]

On Tue, 7 Feb 2006 19:28:31 +0000, "Alan J. Flavell"

Go look at the "safety" page on that site.

By all means. My own references to a wpoison server have done no harm
to the fact that my URLs seem well-indexed at the bona fide search
services.

wpoison uses the Robot Exclusion Protocol already discussed here;
only programs that ignore the robots.txt guidelines that should wind
up in an infinite maze of twisty passages -- all different.
Right

Now, it's a certainty that there are poisoned sites that don't honor
the REP;

And indeed we see them merrily trawling away, in the logs of that
wpoison server. Not just the initial wpoison page to which I'm
pointing, but then recursing their way through the "twisty passages"
to which you refer.

My links have been in place, unchanged, for well over a year (in fact,
I have to confess that I had forgotten all about them after a while,
and only recently remembered they were there), and the logs on the
wpoison server show that the address-trawlers haven't tired of the fun
yet.

cheers

Barbara de Zoete · Feb 7, 2006

I think there are reasonable applications for this type of thing.

There is always the exception.

Sounds like a reasonable app to me. I'd be flattered if somebody chose
my site to be among his mobile pet site list.

Just like it is kind of flattering to see your content get stolen (yes,
happened to me once with a page I was particularly fond off). Doesn't mean
I just let it be.

Barbara de Zoete · Feb 7, 2006

There is always the exception.

Which isn't that phone tingy you mentioned though. I hardly believe one
would download over a hundred pages to a mobile phone (or a notebook for
that matter).

Greg N. · Feb 7, 2006

Which isn't that phone tingy you mentioned though. I hardly believe one
would download over a hundred pages to a mobile phone (or a notebook
for that matter).

I did not mention "phone". I said "mobile" like in "mobile computter".

I think the thing I described is indeed used on laptops, to be used on
the road. IIRC, it's mostly used with biz related intranet stuff, but
it can (and is) also be used with any other sites.

Jochem Huhmann · Feb 7, 2006

Barbara de Zoete said:
Which isn't that phone tingy you mentioned though. I hardly believe one
would download over a hundred pages to a mobile phone (or a notebook for
that matter).

Is archiving a valid purpose? As http://www.archive.org does?

I think when you're publishing anything on the WWW you have to accept
the fact that people or machines look at and download the stuff.

Jochem

Barbara de Zoete · Feb 7, 2006

Is archiving a valid purpose? As http://www.archive.org does?

I think when you're publishing anything on the WWW you have to accept
the fact that people or machines look at and download the stuff.

Look at, downloading while browsing. That's fine, but, hey, I pay the
bandwidth, so I get to decide how it can be used. I don't allow
deeplinking to images, I don't want people to download a random hundred
pages just to look at a dozen of them or so, and throw the rest out.
_Don't_waste_ my bandwidht, _use_ it.

Barry Margolin · Feb 8, 2006

Phil Earnhardt said:
I can't imagine how you would categorically block them. OTOH, the
Robots Exclusion Protocol can be used to tell anyone who honors such
things that you don't want your website copied.

I wouldn't expect a manual download application to honor it. That
mechanism is intended to control automated web crawlers, like the ones
that Google uses to index all of the web.

Phil Earnhardt · Feb 8, 2006

I wouldn't expect a manual download application to honor it. That
mechanism is intended to control automated web crawlers, like the ones
that Google uses to index all of the web.

wget respects the Robot Exclusion Protocol; curl does not.

--phil

Travis Newbury · Feb 8, 2006

Barbara said:
I very much dislike to see someone using those applications on my site.
Downloading over a hundred pages and all that comes with them in a short
time, for what? I'm sure the one who does that is not going to read all
the stuff that just got downloaded, which means it is just a waste of
bandwidth....

Snore....

How to paste n+1 every single time without copying new line from excel	3	Jul 13, 2023
Messages don't show on the website, only stores in Firestore Database	1	Jan 22, 2023
I need help making an html website	2	Aug 2, 2023
I need help fixing my website	2	Oct 15, 2023
Collect Excel Data from Website	5	Apr 30, 2022
Protecting Website Contents	26	Feb 6, 2011
Automate copying e-mails from YouTube	0	Aug 9, 2018
How do I clone website ? Via httrack, F12 or wget [closed]	1	Apr 3, 2022

Copying Website Contents, esp. Message Boards

cmashieldscapting

ray

Barbara de Zoete

Phil Earnhardt

Phil Earnhardt

Alan J. Flavell

cmashieldscapting

cmashieldscapting

Barbara de Zoete

Greg N.

Phil Earnhardt

Alan J. Flavell

Barbara de Zoete

Barbara de Zoete

Greg N.

Jochem Huhmann

Barbara de Zoete

Barry Margolin

Phil Earnhardt

Travis Newbury

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads