E
Elliot Temple
The top link on del.icio.us is a site with all the Calvin + Hobbes
strips. I thought I'd download them before they get taken down. Here's
the code if you want, it's very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.
Before running this consider buying the comics -- what is your
motivation to avoid paying for them? If it's bad, don't do it. (I own
them all in paper already and want an electronic version.) Also create
the c+h_archives folder or change the output path. FYI the images total
about 112 megs. There's 3691 of them.
Code below or here: http://pastie.caboo.se/88946
require "open-uri"
base_url = "http://www.marcellosendos.ch/comics/ch/"
open("http://www.marcellosendos.ch/comics/ch/index.html") do |index|
index.read.scan(/A href="(1.+?)"\>/).each do |archive_page_link|
archive_page_link = base_url + archive_page_link[0]
base_image_url = archive_page_link.gsub(/\/\w+\.\w+$/, "/")
open(archive_page_link) do |archive_page|
archive_page.read.scan(/src="(.+?\.gif)"\>/).each do |img|
img_url = base_image_url + img[0]
begin
open(img_url) do |image_file|
File.open("c+h_archives/#{img[0]}", "w") do |local_file|
local_file.write(image_file.read)
end
end
rescue Exception => e
# there's five broken image links
puts "failed to get #{img_url}"
end
end
end
end
end
strips. I thought I'd download them before they get taken down. Here's
the code if you want, it's very short. Newer people can also see how
easy it is to use open-uri for simple web scraping.
Before running this consider buying the comics -- what is your
motivation to avoid paying for them? If it's bad, don't do it. (I own
them all in paper already and want an electronic version.) Also create
the c+h_archives folder or change the output path. FYI the images total
about 112 megs. There's 3691 of them.
Code below or here: http://pastie.caboo.se/88946
require "open-uri"
base_url = "http://www.marcellosendos.ch/comics/ch/"
open("http://www.marcellosendos.ch/comics/ch/index.html") do |index|
index.read.scan(/A href="(1.+?)"\>/).each do |archive_page_link|
archive_page_link = base_url + archive_page_link[0]
base_image_url = archive_page_link.gsub(/\/\w+\.\w+$/, "/")
open(archive_page_link) do |archive_page|
archive_page.read.scan(/src="(.+?\.gif)"\>/).each do |img|
img_url = base_image_url + img[0]
begin
open(img_url) do |image_file|
File.open("c+h_archives/#{img[0]}", "w") do |local_file|
local_file.write(image_file.read)
end
end
rescue Exception => e
# there's five broken image links
puts "failed to get #{img_url}"
end
end
end
end
end