Download Calvin+Hobbes Script

E

Elliot Temple

The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I'd download them before they get taken down. Here's
the code if you want, it's very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.

Before running this consider buying the comics -- what is your
motivation to avoid paying for them? If it's bad, don't do it. (I own
them all in paper already and want an electronic version.) Also create
the c+h_archives folder or change the output path. FYI the images total
about 112 megs. There's 3691 of them.

Code below or here: http://pastie.caboo.se/88946

require "open-uri"

base_url = "http://www.marcellosendos.ch/comics/ch/"

open("http://www.marcellosendos.ch/comics/ch/index.html") do |index|
index.read.scan(/A href="(1.+?)"\>/).each do |archive_page_link|
archive_page_link = base_url + archive_page_link[0]
base_image_url = archive_page_link.gsub(/\/\w+\.\w+$/, "/")
open(archive_page_link) do |archive_page|
archive_page.read.scan(/src="(.+?\.gif)"\>/).each do |img|
img_url = base_image_url + img[0]
begin
open(img_url) do |image_file|
File.open("c+h_archives/#{img[0]}", "w") do |local_file|
local_file.write(image_file.read)
end
end
rescue Exception => e
# there's five broken image links
puts "failed to get #{img_url}"
end
end
end
end
end
 
J

James Britt

Elliot said:
The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I'd download them before they get taken down. Here's
the code if you want, it's very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.

Before running this consider buying the comics

No, first consider the people hosting the content you're snarfing.

They're footing the bill for bandwidth and hosting.
... FYI the images total
about 112 megs. There's 3691 of them.

And not a single "sleep" in the script. Nice.

I see this sort of shit on ruby-doc.org, spiders ruthlessly fetching
every page in site, one right after another.

It's rude, at the least.

Stupid, too.



James Britt

www.ruby-doc.org
 
E

Elliot Temple

James said:
No, first consider the people hosting the content you're snarfing.

They're footing the bill for bandwidth and hosting.


And not a single "sleep" in the script. Nice.

Hi James,

It's a good thing I posted. I will remember to put a sleep next time.
Thank you.

- Elliot
 
E

Elliot Temple

Elliot said:
Hi James,

It's a good thing I posted. I will remember to put a sleep next time.
Thank you.

Oh. How much sleep is best? One second per image would add an hour to
the script run time. I don't have a sense of how much is needed. 5
seconds? .5 seconds? Is requests per time or volume of data per time
more important to limit?

- Elliot
 
J

James Britt

Elliot said:
Oh. How much sleep is best?

60*60*24 might work.
One second per image would add an hour to
the script run time.

Gosh! Imagine having to wait a *whole hour* to glom someone else's
content!

I don't have a sense of how much is needed. 5
seconds? .5 seconds? Is requests per time or volume of data per time
more important to limit?

You're encouraging people to download 112 MB via 3691 requests from
someone else's Web site.

Right now, the only thing I see being limited is courtesy.

If you abuse a Web site you may have your IP address banned.

Sadly, most people running sites do not have the technical chops to
catch such behavior and cut people off before too much damage is done.

More likely, the target site will either go off-line for excessive
bandwidth, or the owner will get a surprise bill for overages.

There are often very good reasons to spider a site and grab content.
When needed, it must be done in a responsible way. Your example fails
that, both in motivation and technique.


--
James Britt

"Simplicity of the language is not what matters, but
simplicity of use."
- Richard A. O'Keefe in squeak-dev mailing list
 
J

John N Joyner

You're encouraging people to download 112 MB via 3691 requests from
someone else's Web site.
Right now, the only thing I see being limited is courtesy.

I'm no expert, but it seems to me that Mr. Britt makes a
reasonable point. I'd be interested to know whether
Mr. Temple's comment about 5 seconds/.5 seconds was meant
simply as a genuinely "open" question, or whether it was
intended as a comment of some kind.
- JJ
 
J

John Joyce

A. They're probably hosting Calvin & Hobbes strips illegally, so they
get what they get. But in general, if you publish or make public
something, even if held open house in your home, you deal with the
traffic or quit.

B. Buy the books. They're cheap in used bookstores! It's a heck of a
lot less work than writing a script. That said, how many times can
you or will you possibly read them? How much is your time worth to you?
 
P

Phlip

B. Buy the books. They're cheap in used bookstores! It's a heck of a lot
less work than writing a script. That said, how many times can you or
will you possibly read them? How much is your time worth to you?

The ultimate punchline: Calvin has just destroyed Susie D's snowman, and
he's sprawled face-down in the snow. Susie, holding the snowman's head over
him, says, "Calvin, look up!"
 
B

Ben Bleything

A. They're probably hosting Calvin & Hobbes strips illegally, so they
get what they get. But in general, if you publish or make public
something, even if held open house in your home, you deal with the
traffic or quit.

They certainly are, and I know from past experience (supporting a site
that syndicated C&H) that the copyright holders are very protective of
their content. Best to just not mess with it.
B. Buy the books. They're cheap in used bookstores! It's a heck of a
lot less work than writing a script. That said, how many times can
you or will you possibly read them? How much is your time worth to you?

Hear hear.

Ben
 
J

Jaime Iniesta

2007/8/18 said:
The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I'd download them before they get taken down. Here's
the code if you want, it's very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.

Hi Elliot, thanks for the script. Not considering ethics about using
it, sure it is an interesting script. I thought about doing a simpler
version using Hpricot or scRUBYt, but right now your script is always
saying "failed to get...".

Is it me or maybe they have taken measures to avoid direct downloading?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,281
Latest member
Pedroaciny

Latest Threads

Top