[QUIZ] Banned Words (#9)

R

Ruby Quiz

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.grayproductions.net/ruby_quiz/

3. Enjoy!

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Fredrik Jagenheim

At work we discovered that they installed a spam filter that throws away e-mail
that it considers spam. It doesn't use any Bayes Filter, it simply checks for
certain words that it considers 'banned'. One word we discovered was 'sex',
which is a Swedish word for the number six. So the phrase "I'll be home at six
o'clock." will be classified as spam, thrown away and never seen.

The Ruby Quiz I propose is to figure out which words are banned. Since the
filter is a black box, we can only find out which words they are by sending
email through it. The real problem is to find out how to do it with as *few*
emails as possible.

Of course I don't want the ruby community to do a Denial-of-Service on my
employer's mailserver, so do it as a local filter; perhaps something like this:

# by JEG2

#
# A simple class for managing a filter that prevents to use
# of a given _banned_words_ list.
#
class LanguageFilter
#
# Create a new LanguageFilter object that will
# disallow _banned_words_.
# Accepts a list of words, arrays of words,
# or a combination of the two.
#
def initialize( *banned_words )
@banned_words = banned_words.flatten.sort
@clean_calls = 0
end

# A count of the calls to <i>clean?</i>.
attr_reader :clean_calls

#
# Test if provided _text_ is allowable by this filter.
# Returns *false* if _text_ contains _banned_words_,
# *true* if it does not.
#
def clean?( text )
@clean_calls += 1
@banned_words.each do |word|
return false if text =~ /\b#{word}\b/
end
true
end

#
# Verify a _suspect_words_ list against the actual
# _banned_words_ list.
# Returns *false* if the two lists are not identical or
# *true* if the lists do match.
# Accepts a list of words, arrays of words,
# or a combination of the two.
#
def verify( *suspect_words )
suspect_words.flatten.sort == @banned_words
end
end

filter = LanguageFilter.new "six"

filter.clean?("I'll be home at six.") # => false
filter.clean?("Do not taunt the happy fun ball!") # => true

filter.verify("ball") # => false
filter.verify("six") # => true

filter.clean_calls # => 2

So, figure out how to find the hidden words, using as few calls to
LanguageFilter#clean? as possible.

Which algorithms do you find are effective when many words are blocked (like
10%) and which are effective when very few are blocked(1 in 20000)?

All solutions should do better than this: ;)

dict = ["foo", "bar", "six", "baz"]
filter = LanguageFilter.new "six"

p dict - (dict.dup.delete_if { |word| not filter.clean?(word) })
 
J

James Edward Gray II

So, figure out how to find the hidden words, using as few calls to
LanguageFilter#clean? as possible.

A teaser. Or possibly, a goal to shoot for.

3930 words, 29 banned words found
Correct? true
Time taken: 0.16201 seconds
Calls: 427
...

James Edward Gray II
 
B

Brian Schröder

Of course I don't want the ruby community to do a Denial-of-Service on my
employer's mailserver, so do it as a local filter; perhaps something like this:
[snip]

Hello Group,

I propose using

def clean?( text )
@clean_calls += 1
@banned_words.each do |word|
return false if text.include?(word)
end
true
end

in the test filter, because my solution does not work with swedish characters. Trying it with wswedish gives errors when using the regex and no errors when using include?, so I think there must be something weird going on in the regex.

James, be shure to include this kind of words in your test-cases, or specify that it is not needed. I do not want to delve deeper into this.

Regards,

Brian


PS:
Test 1:
Testing with 121430 words and 3 banned words
Took 0.564248085021973 seconds
Used 567 calls
Verified as correct

Test 2
Testing with 121430 words and 126 banned words
Took 6.52887606620789 seconds
Used 5545 calls
Verified as correct

Test 3
Testing with 3930 words and 28 banned words
Took 0.0872080326080322 seconds
Used 497 calls
Verified as correct
 
J

James Edward Gray II

Hello Group,

I propose using

def clean?( text )
@clean_calls += 1
@banned_words.each do |word|
return false if text.include?(word)
end
true
end

in the test filter, because my solution does not work with swedish
characters. Trying it with wswedish gives errors when using the regex
and no errors when using include?, so I think there must be something
weird going on in the regex.

Hmm, I'm bummed that regex doesn't work. I was trying to use something
that would be worldly. Maybe one of the regex gurus will jump in here
and correct it for us.

My complaint about String#include? is that banning the word "box" and
then passing in a message containing the word "boxer" will break it,
right?

James Edward Gray II
 
H

Hal Fulton

James said:
Hmm, I'm bummed that regex doesn't work. I was trying to use something
that would be worldly. Maybe one of the regex gurus will jump in here
and correct it for us.

My complaint about String#include? is that banning the word "box" and
then passing in a message containing the word "boxer" will break it, right?

Maybe something like text.split.include?(word)


Hal
 
H

Hal Fulton

James said:
Yeah, that's probably better, if a little less efficient (I assume).

Probably is less efficient.

Also it's not quite right, as it doesn't ignore punctuation.


Hal
 
J

Jannis Harder

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


My results (I use a list of english words NOT random "words"):


10 Tests (different words banned) with:
121430 Words
3 Banned
average: 81.7 Mails

10 Tests (different words banned) with:
121430 Words
126 Banned
average: 2110.5 Mails

30 Tests (different words banned) with:
3930 Words
29 Banned
average: 356.9 Mails

10 Tests (different words banned) with:
125282 Words(the complete list)
80 Banned
average: 1426.5 Mails



PS: My code is VERY slow (I'm still trying to optimize it).

- --
Jannis Harder (Germany)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (Darwin)

iD8DBQFBp6kj5YRWfc27RzQRAuixAJ91TKY36U1noiooNd1Jix9yt2VuoACeMwiJ
c5BeEQNQR4kigtCzSOP9lRs=
=2MLo
-----END PGP SIGNATURE-----
 
M

Markus Koenig

Hal said:
Probably is less efficient.

Also it's not quite right, as it doesn't ignore punctuation.

What about
ALL_WORDS = /\b[^\s.,:;_!?\/&()"-]+\b/
text.scan(ALL_WORDS).map{|x| x.downcase}.include?(word)
Even less efficient, but I guess it does always work.

Markus
 
B

Brian Schröder

My results (I use a list of english words NOT random "words"):

[snip]

Hello Jannis,

if you are using something like /usr/share/dict/wenglish would you mind sharing your test-code, such that we could compare the algorithms on a fixed basis?

Regards,

Brian
 
B

Brian Schröder

Hmm, I'm bummed that regex doesn't work. I was trying to use something
that would be worldly. Maybe one of the regex gurus will jump in here
and correct it for us.

My complaint about String#include? is that banning the word "box" and
then passing in a message containing the word "boxer" will break it,
right?

James Edward Gray II

Ah, I thought this behaviour was wanted. If this is not wanted my whole optimization was in vain. :(

Regards,

Brian
 
J

James Edward Gray II

Ah, I thought this behaviour was wanted. If this is not wanted my
whole optimization was in vain. :(

Hmm... "sin" and "single" have pretty different meanings. It would
seem unfortunate to group them together.

James Edward Gray II
 
F

Fredrik Jagenheim

Which filter class is the latest? Or should we use the standard
filter, and cope with the fact that it will give false positivies?

I noticed that the current filter gives several false positives, not
only for 'special' characters, but also for apostrophes; say that the
forbidden word is Clara, I will get a positive on both Clara and
Clara's, making it impossible to use verify to check my list...

I decided to 'cheat' by cleaning up my dictionary file from all the
'forbidden' characters and now I average 280ish with a dictionary
containing 127367 words and 10 forbidden words.

//F

Btw, as a sidenote I can say that the spamfilter we have at work have
the exact same problem. It doesn't know swedish characters, so when it
looks at 'medsända' it finds 'meds' which is a forbidden word...
 
J

James Edward Gray II

Which filter class is the latest? Or should we use the standard
filter, and cope with the fact that it will give false positivies?

Ah the joys of getting a specification right, eh? ;)

Test with whatever you are comfortable with. I believe everything
that's been proposed so for has some disadvantages.

I will use what I put forth in the quiz when writing the summary, with
a dictionary that has no special characters in it, not even
apostrophes. In my opinion, this does not affect the focus of the
quiz, which is to find/invent an algorithm.

Hope that helps.

James Edward Gray II
 
W

Wayne Vucenic

Below is my solution, formatted to use Jannis' testing classes.

The solution basically recursively calls clean? on the full array of
words, then on the first half and on the second half, then on the
first quarter, the second quarter, the third quarter, etc. Whenever
clean? returns true, it stops dividing that section of the array.
(This description is breadth-first, but the algorithm actually runs
depth-first. Breadth-first is easier to explain, and the final result
is the same.)

I then noticed that we don't really need to make all these calls to
clean?, because we can sometimes deduce what the results will be
without calling clean?. For example, if we call clean? on [0...8]
and the result is false, then we call clean? on [0...4] and the result
is true, then we know that clean? on [4...8] must return false, so we
don't need to make this call.

The ordinary recursion is done by the method findBanned(aWords). When
the above deduction tells us that clean? will return false, we instead
call the method findBannedThereIsOne(aWords). In the above example we
would call findBannedThereIsOne(aWords[4...8]), and it will skip
calling clean? on aWords[4...8].

(Yes, findBannedThereIsOne is an awkward name, but at least it's
better than findBannedAndWeKnowThereIsAtLeastOne)

When findBannedThereIsOne is called with an array with only one
element, it doesn't need to call clean? at all, because we know this
must be a banned word.

So that findBanned and findBannedThereIsOne only need to deal with
arrays whose size is a power of two, the main routine,
runYourAlgorithm, divides the input array into chunks of these sizes.
As I'm typing this, it occurs to me that this code might actually run
faster if runYourAlgorithm didn't do this. I'll give this a try, but
for now I'll post the version that does the chunking.

runYourAlgorithm also handles the case of an empty array, so the other
two routines don't need to handle it.

Here are the results with Jannis' testing classes. For the times in
seconds, keep in mind that these tests were run on an ancient 500MHz
Pentium III with other work going on at the same time.

Runs: 10
Words: 121430
Banned Words: 3
Average
Seconds: 0.5561
Mails: 76
Verified: 10/10
Rebuilt Banned Words

Runs: 10
Words: 121430
Banned Words: 126
Average
Seconds: 254.4113
Mails: 1948
Verified: 10/10
Rebuilt Banned Words

Runs: 10
Words: 3930
Banned Words: 29
Average
Seconds: 12.5187
Mails: 327
Verified: 10/10
Rebuilt Banned Words

Runs: 10
Words: 125282
Banned Words: 80
Average
Seconds: 139.9058
Mails: 1319
Verified: 10/10
Rebuilt Banned Words

Runs: 10
Words: 127367
Banned Words: 10
Average
Seconds: 1.5328
Mails: 210
Verified: 10/10
Rebuilt Banned Words


Wayne Vucenic
No Bugs Software
"Ruby and C++ Contract Programming in Silicon Valley"


======================================

class YourAlgorithm < RQ9Algorithm
def run()
runYourAlgorithm(@words)
end

# Returns an array containing all banned words from aWords
def runYourAlgorithm(aWords)
return [] if aWords.empty?

# Compute the largest power of two <= aWords.size
powerOfTwo = 1
powerOfTwo *= 2 while powerOfTwo * 2 <= aWords.size

# To simplify the logic in findBanned, we always call it with an
# array whose size is a power of two.
aBanned = findBanned(aWords[0...powerOfTwo])

# If we didn't process all of aWords, recursively do the rest
if powerOfTwo < aWords.size
aBanned += runYourAlgorithm(aWords[powerOfTwo..-1])
end

aBanned
end

# Returns an array containing all banned words from aWords
# aWords.size is a power of 2, and is > 0
def findBanned(aWords)
if aWords.size == 1
@filter.clean?(aWords[0]) ? [] : aWords
elsif @filter.clean?(aWords.join(' '))
[]
else
iSplit = aWords.size / 2
if @filter.clean?(aWords[0...iSplit].join(' '))
# There is at least one banned word in 0..-1, but not in 0...iSplit,
# so there must be one in iSplit..-1
findBannedThereIsOne(aWords[iSplit..-1])
else
# From the test above we know there is a banned word in 0...iSplit
findBannedThereIsOne(aWords[0...iSplit]) +
findBanned(aWords[iSplit..-1])
end
end
end

# Returns an array containing all banned words from aWords
# aWords.size is a power of 2, and is > 0
# Our caller has determined there is at least one banned word in aWords
def findBannedThereIsOne(aWords)
if aWords.size == 1
# Since we know there is at least one banned word, and since there is
# only one word in the array, we know this word is banned without
# having to call clean?
aWords
else
iSplit = aWords.size / 2
if @filter.clean?(aWords[0...iSplit].join(' '))
# There is at least one banned word in 0..-1, but not in 0...iSplit,
# so there must be one in iSplit..-1
findBannedThereIsOne(aWords[iSplit..-1])
else
# From the test above we know there is a banned word in 0...iSplit
findBannedThereIsOne(aWords[0...iSplit]) +
findBanned(aWords[iSplit..-1])
end
end
end
end
 
B

Brian Schröder

Hello Grouo

This was a short quiz. It was not possible for me to get a lot better than my first try. (Until I just now read the other solution :( )

Even though it was not too complicated I again spend too much time on it. If I'd live in the states, I'd definitly sue James for stealing my time;)

The main idea is, that given a list of words, I check if it contains a word that is banned. If the whole chunk passes, I can forget about all words in this chunk and return the empty list.

Otherwise I split the list into equally sized chunks and continue with each chunk. If a chunk that contains only one word is tested and matches, I have found a match.

Here is the algorithm implemented for splitting into two slices.

def find_words(filter, words)
return [] if words.empty? or filter.clean?(words.join(' '))
return words if words.length == 1
return find_words(filter, words[0...words.length / 2]) +
find_words(filter, words[words.length / 2..-1])
end

The fastest algorithm of this type is the above for n = 3. Here is the generic version.

def find_words_n(filter, words, n = 2)
return [] if words.empty? or filter.clean?(words.join(' '))
return words if words.length == 1
n = words.length if n > words.length
slices = Array.new(n) { | i | i * words.length / n } << words.length
slices[0..-2].zip(slices[1..-1]).inject([]) do | result, (low, high) |
result + find_words_n(filter, words[low...high], n)
end
end

Then I tried to find the optimal solution on a more formal basis. My musings are can be read here:
http://ruby.brian-schroeder.de/quiz/detect_words/content/content.html
http://ruby.brian-schroeder.de/quiz/detect_words/content.ps

This resulted in a huge lookup table where I have the optimal split factor for a given number of words and expected probabilty.

After reading the solution by Wayne Vucenic I understood that one can save some calls, if we now that we had a match in wordlist w and no match in the first parts of the wordlist. The last part then neccessarily has to match.

This is implemented here

def find_words(filter, words, filter_mask = true)
return [] if words.empty? or (filter_mask and filter.clean?(words.join(' ')))
return words if words.length == 1
result = find_words(filter, words[0...words.length / 2])
result + find_words(filter, words[words.length / 2..-1], !result.empty?)
end

And for n > 2:

def find_words_n(filter, words, n = 2, filter_mask = true)
return [] if words.empty? or (filter_mask and filter.clean?(words.join(' ')))
return words if words.length == 1
n = words.length if n > words.length
slices = Array.new(n) { | i | i * words.length / n } << words.length
slices = slices[0..-2].zip(slices[1..-1])
result = slices[0..-2].inject([]) do | result, (low, high) | result + find_words_n(filter, words[low...high], n) end
result + find_words_n(filter, words[slices[-1][0]...slices[-1][1]], n, !result.empty?)
end

Also I implemented a solution, that saves calls depending on the information that also superwords of banned words are banned. E.g. if sex is banned, sexual is also banned. It turned out that the specs did not include this behaviour, so the solution can be found under the unsuccessfull tries.

The full story (All sources, results, etc) is here:
http://ruby.brian-schroeder.de/quiz/detect_words/

regards,

Brian
 
J

James Edward Gray II

The solution basically recursively calls clean? on the full array of
words, then on the first half and on the second half, then on the
first quarter, the second quarter, the third quarter, etc. Whenever
clean? returns true, it stops dividing that section of the array.

Below is my solution, which works the same way.

This "Divide and Conquer" approach is very effective when only a few
words are banned, but I believe there are better approaches for many
banned words. Hopefully, we'll see one submitted...
I then noticed that we don't really need to make all these calls to
clean?, because we can sometimes deduce what the results will be
without calling clean?. For example, if we call clean? on [0...8]
and the result is false, then we call clean? on [0...4] and the result
is true, then we know that clean? on [4...8] must return false, so we
don't need to make this call.

Clever. I didn't think of this.
Here are the results with Jannis' testing classes.
[...]

Words: 121430
Banned Words: 3
[...]

Words: 121430
Banned Words: 126
[...]

Words: 3930
Banned Words: 29
[...]

Words: 125282
Banned Words: 80
[...]

Words: 127367
Banned Words: 10

I consider all of these tests to be in the "very few are blocked"
category, from the quiz. Anyone try anything closer to the suggested
10% blocked from the quiz?

James Edward Gray II

#!/usr/bin/env ruby

#
# A simple class for managing a filter that prevents to use
# of a given _banned_words_ list.
#
class LanguageFilter
#
# Create a new LanguageFilter object that will
# disallow _banned_words_.
# Accepts a list of words, arrays of words,
# or a combination of the two.
#
def initialize( *banned_words )
@banned_words = banned_words.flatten.sort
@clean_calls = 0
end

# A count of the calls to <i>clean?</i>.
attr_reader :clean_calls

#
# Test if provided _text_ is allowable by this filter.
# Returns *false* if _text_ contains _banned_words_,
# *true* if it does not.
#
def clean?( text )
@clean_calls += 1
@banned_words.each do |word|
return false if text =~ /\b#{word}\b/
end
true
end

#
# Verify a _suspect_words_ list against the actual
# _banned_words_ list.
# Returns *false* if the two lists are not identical or
# *true* if the lists do match.
# Accepts a list of words, arrays of words,
# or a combination of the two.
#
def verify( *suspect_words )
suspect_words.flatten.sort == @banned_words
end
end

# my algorithm
def isolate( list, test )
if test.clean? list.join(" ")
Array.new
elsif list.size == 1
list
else
left, right = list[0...(list.size / 2)], list[(list.size / 2)..-1]
isolate(left, test) + isolate(right, test)
end
end

# test code
words = ARGF.read.split " "
filter = LanguageFilter.new words.select { rand <= 0.01 }

start = Time.now
banned = isolate words, filter
time = Time.now - start

puts "#{words.size} words, #{banned.size} banned words found"
puts "Correct? #{filter.verify banned}"
puts "Time taken: #{time} seconds"
puts "Calls: #{filter.clean_calls}"
puts "Words:"
puts banned.map { |word| "\t" + word }

__END__
 
J

Jannis Harder

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

My solution:

1 Send mails with N words (see @steps)
2 delete all words of the clean mails
3 shuffle word list
4 repeat with a smaller N

###########
class JannisHardersAlgorithm < RQ9Algorithm
def initialize()
@steps = [110000,700000,300000,200000,110000,
70000,30000,20000,11000,7000,
3000,2000,1100,700,300,200,110,
70,30,20,11,7,3,2,1]
end
def run()
wordsLeft = @words.dup

@steps.each do |count|
if count*6 > @words.length
next
end

position = 0
while position < wordsLeft.length
testMail = wordsLeft[position ..
position+count-1].join(" ")
if @filter.clean? testMail
(position .. position+count-1).each do |i|
wordsLeft=nil
end
end
position += count
end
wordsLeft.compact! # delete nil words
wordsLeft = wordsLeft.sort_by{rand} #shuffle
end
wordsLeft
end
end
##########
Results :
Test: (15000,2,5,true) # 0.1%
Runs: 5
Words: 15000
Banned Words: 2
Average
Seconds: 0.1430744
Mails: 47
Verified: 5/5
Rebuilt Banned Words



Test: (15000,15,5,true) # 1%
Runs: 5
Words: 15000
Banned Words: 15
Average
Seconds: 0.833412
Mails: 260
Verified: 5/5
Rebuilt Banned Words


Test: (15000,150,5,true) # 10%
Runs: 5
Words: 15000
Banned Words: 150
Average
Seconds: 9.6565792
Mails: 1786
Verified: 5/5
Rebuilt Banned Words
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (Darwin)

iD8DBQFBqlIu5YRWfc27RzQRAiUmAKCABFWkF/7DHWV1RvmdZYFN1RwiBgCgzebx
zKrKCGj2L0SuRaayuRvGJJg=
=qzPr
-----END PGP SIGNATURE-----
 
B

Brian Schröder

My solution:

1 Send mails with N words (see @steps)
2 delete all words of the clean mails
3 shuffle word list
4 repeat with a smaller N

###########
class JannisHardersAlgorithm < RQ9Algorithm
def initialize()
@steps = [110000,700000,300000,200000,110000,
70000,30000,20000,11000,7000,
3000,2000,1100,700,300,200,110,
70,30,20,11,7,3,2,1]
end
[snip]

Interesting. How did you come up with the step numbers?

Regards,

Brian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top