ruby-doc.org content snarfing

Discussion in 'Ruby' started by James Britt, Aug 20, 2004.

  1. James Britt

    James Britt Guest

    I received an alert E-mail today telling me that ruby-doc.org had
    exceeded its alloted bandwidth. There could be all sorts of reasons for
    this, and if it were simply due to popularity I'd be thrilled. But it
    appears that someone has been running wget and snarfing the site wholesale.

    This is a bad thing. I'm in the process of blocking IP addresses and
    domain names. I've also turned off access to the Euroko 2003 videos
    until next month.

    I really don't think this abuse is coming from any regular reader of
    this list, but on the off chance that I'm wrong: please stop it.

    Thanks,

    James Britt

    jbritt AT ruby-doc DOT org
     
    James Britt, Aug 20, 2004
    #1
    1. Advertisements

  2. James Britt

    Aredridel Guest

    James -- is there a possibility of you providing a tarball of the whole
    site for download?

    I'd be more than willing to host a tracker and seed for a Bit Torrent
    copy of it.

    Ari
     
    Aredridel, Aug 21, 2004
    #2
    1. Advertisements

  3. James Britt

    David Morton Guest

    Not sure if this is the problem, but I noticed that
    http://www.ruby-doc.org/robots.txt doesn't exist. Is it getting hit by
    search bots?
     
    David Morton, Aug 21, 2004
    #3
  4. James Britt

    James Britt Guest

    Crawlers and spiders and such are fine. They hit the site often.

    I've never seem this sort of downloading, though. And the legit spiders
    (i.e., the ones likely to respect a robots.txt file) tend to identify
    themselves as such. This one didn't.

    That's not to say that a robots.txt file wouldn't be a bad thing, just
    that spiders have never been a problem.


    James
     
    James Britt, Aug 21, 2004
    #4
  5. James Britt

    James Britt Guest

    A tarball of the whole site would be close to 5 GB. That's including the
    videos. Torrents for the videos might be a good idea; I'm not sure a
    torrent for anything else is all that useful. There are assorted
    bundles and stand-alone files that are easy to download as needed. Much
    of the site's content consists of links to other places. There's the
    HTML version of Programming Ruby, and the core and standard lib docs.

    Each of these gets updated on a different schedule. A monolithic
    tarball would be out of date fairly quick.

    In actual practice, the traffic has been fine. It's just this one time
    somebody decided to go grab what appears to be *everything*, all in one day.

    Thank you for the offer, though. When I first started hosting videos on
    the site was unfamiliar with bittorrent. Now, though, it seems as if
    it would be a good option for the larger files.


    James
     
    James Britt, Aug 21, 2004
    #5
  6. James Britt

    Aredridel Guest

    Yeah, torrents make sense for anything over 10mb.
    Yeah -- a few large ones might appease those wanting a copy. I know I've
    toyed with the idea. I'd love a copy of good chunks of the site for
    offline reading.
    Ouch. That happened to me once...
    Yeah.

    Have a good one.

    Ari
     
    Aredridel, Aug 21, 2004
    #6
  7. You should look at mod_throttle and set up a few .htaccess rules against
    annoying web spiders:

    SetEnvIf user-agent MSIECrawler keep_out
    SetEnvIf user-agent ^Teleport keep_out
    SetEnvIf user-agent ^WebStripper keep_out
    SetEnvIf user-agent ^Offline keep_out
    SetEnvIf user-agent HTTrack keep_out
    SetEnvIf user-agent Xaldon keep_out
    SetEnvIf user-agent WebCopier keep_out

    <Limit GET POST >
    order allow,deny
    allow from all
    deny from env=keep_out
    </Limit>
     
    Andreas Schwarz, Aug 21, 2004
    #7
  8. James Britt

    Ruby Script Guest

    I highly recommend you look into mod_dosevasive.

    If there was a decent way to verify the authenticity of the source IP
    addresses (ie not spoofed), then blocking would be a great first step.

    The next step might be posting the verified abusive addresses online (so
    the rest of us can take appropriate action like blocking them from our
    sites) or submitting them to dshield.org. This might be annoying enough
    for them to move on to other targets.
     
    Ruby Script, Aug 22, 2004
    #8
  9. James Britt

    Mark Hubbart Guest

    From what has been said, I doubt that the person who did this was being
    malicious... Just ignorant. That's something that I might have done,
    before got experience as a webmaster and realized how rotten it can be
    :) So it might be a little bit of overkill to share their ip addresses
    for mass banning. Maybe a just good slap on the wrist, like redirecting
    all their page requests to very_stern_warning.text

    Of course, I might be wrong, and they might actually be *wanting* to
    cause problems. In which case, they should be taken out and shot :D

    cheers,
    Mark
     
    Mark Hubbart, Aug 22, 2004
    #9
  10. James Britt

    James Britt Guest

    I've banned specific IP addresses and domain names. If I get complaints
    then perhaps I'll undo it. I don't really expect that.

    As I don't know the identity of the people responsible, nor their
    motives, I've focusing on protecting the site. If I find reason to
    believe someone is being malicious or willfully thoughtless, I'll
    consider other action.

    It may very well have been a site-grabber script gone bad. I don't
    know. I'm mainly concerned with preventing it in the future, and I
    appreciate the helpful comments I've received here.


    Thanks,


    James
     
    James Britt, Aug 22, 2004
    #10
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.