Hello all.
I've been helping the RubyGarden folks deal with some of the wiki spam
issues and thought I would provide an update here just so folks are
informed. I'll try to answer some of the questions that come up and let
you know what we have done.
But first the good news! A spam attack that targetted nearly 90 separate
pages was thwarted this morning with no need to manually revert any pages
(well, except for one).
Ok, now for a couple of responses ...
From "Charles Comstock said:
But seriously, it would seem the capitalization hack was only semi
successful.
I believe the capitalization trick weeded out most of the non-hard core
spammers. The ones that were left were VERY determined to get in. We
have added additional restrictions beyond the HTTP hack that was triggered
by typical spammer types of posting and you could watch several of the
spammers try different variations on their post until it got past the
filters. The fellow posting links to pack001.com was particularly
persistent.
It seems to me that most of the bots seem to put in links
in just a long series. Perhaps we could change it so your limited to
say 5 outbound links per page edit?
Three things ...
(1) I really doubt they are bots for the most part ... at least not the
ones that are left. I watched one fellow this morning make a post, and
then repost to correct a error in the first attempt. Also, the timings of
the postings seems very non-machine like. The repeated attempts
attempting to find a work around the filters also indicates a human behind
the wheel.
That't not to say there are no bots out there, but a significant fraction
are human. My theory is that the all the recent "no-call" lists put a lot
of phone sales people out of work and the only job they could find where
they could be equally as annoying is working as wiki spammers.
(2) Not all spam is long chains of links. The first (and only successful)
spam posting this morning was changing an existing link from
www.lughead.org to
www.sister8.com on the MooreheadGroup page.
(3) Detecting the addition of links is difficult because the change is
submitted to the wiki as a whole page, with nothing to distinguish the new
material from the old. To determine that we would have to run a diff
algorithm between the old and new content. Although doable, I'm looking
for easy to implement features right now because I'm dealing with perl
code in UseMod (shudder!). (And yes, we are agressively pursuing switching
to a Ruby based wiki engine ... I can't take too much more Perl!).
Or we could just scan for invisible
spans which seem to be a favorite of the bots at the moment?
Now that is a good idea. I only became aware of that technique this
morning, but its on my list of things to try.
Asfand Yar Qazi asks:
Why isn't the Wiki password-protected so that only authorised users can
edit it?
Part of the magic of the "wiki way" is gathering the input from legitimate
users. Therefore we want to make the barrier of entry very very low,
otherwise people just won't bother to post. It is entirely possible that
we may be forced to go to a user registration process, but I hope that's a
last resort and not the first response.
how about generating a jpg of a password and requiring the editor
to enter it. this would block the bots i think.
This is known as a Captcha test (
http://en.wikipedia.org/wiki/Captcha). As
mentioned earlier, I think most of the simple minded bots have been
eliminated, and the remaining spammers are either human or bots closely
monitored by humans. I suspect that captcha will have less of an effect
than we would hope. Of course I could be wrong and may setup a test of
this. Also, as someone else noted, captcha systems suffer from some
concern over accessibility issues.
Belorion said:
I've seen a couple of wiki's that have Honeypots embedded in them.
Basically, when you click on the honeypot link you are taken to a page
which says basically "do not click *this* link, it will disable your
access to this wiki for X time". So, when a bot crawls the whole
page, it will also crawls the honeypot, and the IP gets logged and
banned from the site.
I first heard of something like this suggested by Patrick May (NARF
developer) at this years RubyConf. Patrick called it a tarpit.
Essentially, spammers are routed to a shadow wiki that looks just like the
real one. Any changes they make to the pages only exist on the shadow
wiki and are not reflected on the real thing.
The beauty of this approach is that the spammers have no idea that they
have been redirected to a fake sight. The problem with the current
banning system is that spammers know immediately that they have failed, so
they begin to investigate work arounds (e.g. switching IP addresses,
modifying the post content). If they think their spam is successful, then
they have no motivation to try harder.
Another feature of tarpits is the deliberate slowing of responses. If the
spammer has to wait forever for a page to update, it encourages them to
look elsewhere. I find a certain amount of self-righteous glee in the
thought of annoying spammers.
The downside to tarpits is two-fold. First, we still have the problem of
identifying spammers. A banlist (realtime or manual) is one possibility.
Another is by self-identifying behavior. E.g. the invisible link trick is
a strong indication you are a spammer.
The other downside is the possibility that legitimate users getting caught
in the tarpit. Since the tarpit /looks/ legitimate, I can easily imagine
legitemate users caught there without them ever realizaing it. If you
find that the wiki suddenly starts loosing day-old posts of yours, drop
the wikimaster a note and ask if you have been tarpitted. I'm planning on
some tarpit management software where we can review the tarpitted users
and fix any accidents. All in good time.
I have actually implemented a prototype tarpit on the current wiki and am
monitoring it to see how effective it is. We actually caught one spammer
this morning immediately after his first post, so his remaining 90 odd
changes went directly into the tarpit.
A *very* satisfying morning!