ruby-doc.org content snarfing

J

James Britt

I received an alert E-mail today telling me that ruby-doc.org had
exceeded its alloted bandwidth. There could be all sorts of reasons for
this, and if it were simply due to popularity I'd be thrilled. But it
appears that someone has been running wget and snarfing the site wholesale.

This is a bad thing. I'm in the process of blocking IP addresses and
domain names. I've also turned off access to the Euroko 2003 videos
until next month.

I really don't think this abuse is coming from any regular reader of
this list, but on the off chance that I'm wrong: please stop it.

Thanks,

James Britt

jbritt AT ruby-doc DOT org
 
A

Aredridel

I received an alert E-mail today telling me that ruby-doc.org had
exceeded its alloted bandwidth. There could be all sorts of reasons for
this, and if it were simply due to popularity I'd be thrilled. But it
appears that someone has been running wget and snarfing the site wholesale.

This is a bad thing. I'm in the process of blocking IP addresses and
domain names. I've also turned off access to the Euroko 2003 videos
until next month.

I really don't think this abuse is coming from any regular reader of
this list, but on the off chance that I'm wrong: please stop it.

James -- is there a possibility of you providing a tarball of the whole
site for download?

I'd be more than willing to host a tracker and seed for a Bit Torrent
copy of it.

Ari
 
J

James Britt

David said:
Not sure if this is the problem, but I noticed that
http://www.ruby-doc.org/robots.txt doesn't exist. Is it getting hit by
search bots?

Crawlers and spiders and such are fine. They hit the site often.

I've never seem this sort of downloading, though. And the legit spiders
(i.e., the ones likely to respect a robots.txt file) tend to identify
themselves as such. This one didn't.

That's not to say that a robots.txt file wouldn't be a bad thing, just
that spiders have never been a problem.


James
 
J

James Britt

Aredridel said:
James -- is there a possibility of you providing a tarball of the whole
site for download?

I'd be more than willing to host a tracker and seed for a Bit Torrent
copy of it.

A tarball of the whole site would be close to 5 GB. That's including the
videos. Torrents for the videos might be a good idea; I'm not sure a
torrent for anything else is all that useful. There are assorted
bundles and stand-alone files that are easy to download as needed. Much
of the site's content consists of links to other places. There's the
HTML version of Programming Ruby, and the core and standard lib docs.

Each of these gets updated on a different schedule. A monolithic
tarball would be out of date fairly quick.

In actual practice, the traffic has been fine. It's just this one time
somebody decided to go grab what appears to be *everything*, all in one day.

Thank you for the offer, though. When I first started hosting videos on
the site was unfamiliar with bittorrent. Now, though, it seems as if
it would be a good option for the larger files.


James
 
A

Aredridel

A tarball of the whole site would be close to 5 GB. That's including the
videos. Torrents for the videos might be a good idea; I'm not sure a
torrent for anything else is all that useful. There are assorted
bundles and stand-alone files that are easy to download as needed. Much
of the site's content consists of links to other places. There's the
HTML version of Programming Ruby, and the core and standard lib docs.

Yeah, torrents make sense for anything over 10mb.
Each of these gets updated on a different schedule. A monolithic
tarball would be out of date fairly quick.

Yeah -- a few large ones might appease those wanting a copy. I know I've
toyed with the idea. I'd love a copy of good chunks of the site for
offline reading.
In actual practice, the traffic has been fine. It's just this one time
somebody decided to go grab what appears to be *everything*, all in one day.

Ouch. That happened to me once...
Thank you for the offer, though. When I first started hosting videos on
the site was unfamiliar with bittorrent. Now, though, it seems as if
it would be a good option for the larger files.

Yeah.

Have a good one.

Ari
 
A

Andreas Schwarz

James said:
I received an alert E-mail today telling me that ruby-doc.org had
exceeded its alloted bandwidth. There could be all sorts of reasons for
this, and if it were simply due to popularity I'd be thrilled. But it
appears that someone has been running wget and snarfing the site wholesale.

You should look at mod_throttle and set up a few .htaccess rules against
annoying web spiders:

SetEnvIf user-agent MSIECrawler keep_out
SetEnvIf user-agent ^Teleport keep_out
SetEnvIf user-agent ^WebStripper keep_out
SetEnvIf user-agent ^Offline keep_out
SetEnvIf user-agent HTTrack keep_out
SetEnvIf user-agent Xaldon keep_out
SetEnvIf user-agent WebCopier keep_out

<Limit GET POST >
order allow,deny
allow from all
deny from env=keep_out
</Limit>
 
R

Ruby Script

James said:
I received an alert E-mail today telling me that ruby-doc.org had
exceeded its alloted bandwidth. There could be all sorts of reasons for
this, and if it were simply due to popularity I'd be thrilled. But it
appears that someone has been running wget and snarfing the site wholesale.

This is a bad thing. I'm in the process of blocking IP addresses and
domain names. I've also turned off access to the Euroko 2003 videos
until next month.

I really don't think this abuse is coming from any regular reader of
this list, but on the off chance that I'm wrong: please stop it.

Thanks,

James Britt

jbritt AT ruby-doc DOT org

I highly recommend you look into mod_dosevasive.

If there was a decent way to verify the authenticity of the source IP
addresses (ie not spoofed), then blocking would be a great first step.

The next step might be posting the verified abusive addresses online (so
the rest of us can take appropriate action like blocking them from our
sites) or submitting them to dshield.org. This might be annoying enough
for them to move on to other targets.
 
M

Mark Hubbart

I highly recommend you look into mod_dosevasive.

If there was a decent way to verify the authenticity of the source IP
addresses (ie not spoofed), then blocking would be a great first step.

The next step might be posting the verified abusive addresses online
(so the rest of us can take appropriate action like blocking them from
our sites) or submitting them to dshield.org. This might be annoying
enough for them to move on to other targets.

From what has been said, I doubt that the person who did this was being
malicious... Just ignorant. That's something that I might have done,
before got experience as a webmaster and realized how rotten it can be
:) So it might be a little bit of overkill to share their ip addresses
for mass banning. Maybe a just good slap on the wrist, like redirecting
all their page requests to very_stern_warning.text

Of course, I might be wrong, and they might actually be *wanting* to
cause problems. In which case, they should be taken out and shot :D

cheers,
Mark
 
J

James Britt

Mark said:
From what has been said, I doubt that the person who did this was being
malicious... Just ignorant. That's something that I might have done,
before got experience as a webmaster and realized how rotten it can be
:) So it might be a little bit of overkill to share their ip addresses
for mass banning. Maybe a just good slap on the wrist, like redirecting
all their page requests to very_stern_warning.text

I've banned specific IP addresses and domain names. If I get complaints
then perhaps I'll undo it. I don't really expect that.

As I don't know the identity of the people responsible, nor their
motives, I've focusing on protecting the site. If I find reason to
believe someone is being malicious or willfully thoughtless, I'll
consider other action.

It may very well have been a site-grabber script gone bad. I don't
know. I'm mainly concerned with preventing it in the future, and I
appreciate the helpful comments I've received here.


Thanks,


James
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top