Download Calvin+Hobbes Script

Elliot Temple · Aug 18, 2007

The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I'd download them before they get taken down. Here's
the code if you want, it's very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.

Before running this consider buying the comics -- what is your
motivation to avoid paying for them? If it's bad, don't do it. (I own
them all in paper already and want an electronic version.) Also create
the c+h_archives folder or change the output path. FYI the images total
about 112 megs. There's 3691 of them.

Code below or here: http://pastie.caboo.se/88946

require "open-uri"

base_url = "http://www.marcellosendos.ch/comics/ch/"

open("http://www.marcellosendos.ch/comics/ch/index.html") do |index|
index.read.scan(/A href="(1.+?)"\>/).each do |archive_page_link|
archive_page_link = base_url + archive_page_link[0]
base_image_url = archive_page_link.gsub(/\/\w+\.\w+$/, "/")
open(archive_page_link) do |archive_page|
archive_page.read.scan(/src="(.+?\.gif)"\>/).each do |img|
img_url = base_image_url + img[0]
begin
open(img_url) do |image_file|
File.open("c+h_archives/#{img[0]}", "w") do |local_file|
local_file.write(image_file.read)
end
end
rescue Exception => e
# there's five broken image links
puts "failed to get #{img_url}"
end
end
end
end
end

James Britt · Aug 18, 2007

Elliot said:
The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I'd download them before they get taken down. Here's
the code if you want, it's very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.

Before running this consider buying the comics

No, first consider the people hosting the content you're snarfing.

They're footing the bill for bandwidth and hosting.

... FYI the images total
about 112 megs. There's 3691 of them.

And not a single "sleep" in the script. Nice.

I see this sort of shit on ruby-doc.org, spiders ruthlessly fetching
every page in site, one right after another.

It's rude, at the least.

Stupid, too.

James Britt

www.ruby-doc.org

Elliot Temple · Aug 18, 2007

James said:
No, first consider the people hosting the content you're snarfing.

They're footing the bill for bandwidth and hosting.

And not a single "sleep" in the script. Nice.

Hi James,

It's a good thing I posted. I will remember to put a sleep next time.
Thank you.

- Elliot

Elliot Temple · Aug 19, 2007

Elliot said:
Hi James,

It's a good thing I posted. I will remember to put a sleep next time.
Thank you.

Oh. How much sleep is best? One second per image would add an hour to
the script run time. I don't have a sense of how much is needed. 5
seconds? .5 seconds? Is requests per time or volume of data per time
more important to limit?

- Elliot

James Britt · Aug 19, 2007

Elliot said:
Oh. How much sleep is best?

60*60*24 might work.

One second per image would add an hour to
the script run time.

Gosh! Imagine having to wait a *whole hour* to glom someone else's
content!

I don't have a sense of how much is needed. 5
seconds? .5 seconds? Is requests per time or volume of data per time
more important to limit?

You're encouraging people to download 112 MB via 3691 requests from
someone else's Web site.

Right now, the only thing I see being limited is courtesy.

If you abuse a Web site you may have your IP address banned.

Sadly, most people running sites do not have the technical chops to
catch such behavior and cut people off before too much damage is done.

More likely, the target site will either go off-line for excessive
bandwidth, or the owner will get a surprise bill for overages.

There are often very good reasons to spider a site and grab content.
When needed, it must be done in a responsible way. Your example fails
that, both in motivation and technique.

--
James Britt

"Simplicity of the language is not what matters, but
simplicity of use."
- Richard A. O'Keefe in squeak-dev mailing list

John N Joyner · Aug 19, 2007

You're encouraging people to download 112 MB via 3691 requests from
someone else's Web site.
Right now, the only thing I see being limited is courtesy.

I'm no expert, but it seems to me that Mr. Britt makes a
reasonable point. I'd be interested to know whether
Mr. Temple's comment about 5 seconds/.5 seconds was meant
simply as a genuinely "open" question, or whether it was
intended as a comment of some kind.
- JJ

John Joyce · Aug 19, 2007

A. They're probably hosting Calvin & Hobbes strips illegally, so they
get what they get. But in general, if you publish or make public
something, even if held open house in your home, you deal with the
traffic or quit.

B. Buy the books. They're cheap in used bookstores! It's a heck of a
lot less work than writing a script. That said, how many times can
you or will you possibly read them? How much is your time worth to you?

Phlip · Aug 19, 2007

B. Buy the books. They're cheap in used bookstores! It's a heck of a lot

less work than writing a script. That said, how many times can you or
will you possibly read them? How much is your time worth to you?

The ultimate punchline: Calvin has just destroyed Susie D's snowman, and
he's sprawled face-down in the snow. Susie, holding the snowman's head over
him, says, "Calvin, look up!"

Ben Bleything · Aug 19, 2007

A. They're probably hosting Calvin & Hobbes strips illegally, so they
get what they get. But in general, if you publish or make public
something, even if held open house in your home, you deal with the
traffic or quit.

They certainly are, and I know from past experience (supporting a site
that syndicated C&H) that the copyright holders are very protective of
their content. Best to just not mess with it.

B. Buy the books. They're cheap in used bookstores! It's a heck of a
lot less work than writing a script. That said, how many times can
you or will you possibly read them? How much is your time worth to you?

Hear hear.

Ben

Jaime Iniesta · Aug 19, 2007

2007/8/18 said:
The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I'd download them before they get taken down. Here's
the code if you want, it's very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.

Hi Elliot, thanks for the script. Not considering ethics about using
it, sure it is an interesting script. I thought about doing a simpler
version using Hpricot or scRUBYt, but right now your script is always
saying "failed to get...".

Is it me or maybe they have taken measures to avoid direct downloading?

Could use some help here Ruby script that downloads .png files thatare ordered in sequence and sav	3	Feb 7, 2013
synchronize between download and save in multi-threads?	1	Aug 14, 2010
FLV download script works, but I want to enhance it	3	May 6, 2009
Download a file piece by piece	2	Nov 30, 2006
[SCRIPT] auto rubyforge releasing	2	Nov 4, 2005
[SUMMARY] Mailing List Files (#115)	0	Mar 1, 2007
Script to fetch Wikipedia text	4	Oct 11, 2006
How to download a file with asp.net 1.1	0	Dec 10, 2005

Download Calvin+Hobbes Script

Elliot Temple

James Britt

Elliot Temple

Elliot Temple

James Britt

John N Joyner

John Joyce

Phlip

Ben Bleything

Jaime Iniesta

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads