[QUIZ] Encyclopedia Construction (#205)

D

Daniel Moore

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this
quiz until 48 hours have elapsed from the time this message was
sent.

2. Support Ruby Quiz by submitting ideas and responses
as often as you can!
Visit: <http://rubyquiz.strd6.com/suggestions>

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem
helps everyone on Ruby Talk follow the discussion. Please reply to
the original quiz message, if you can.

RSS Feed: http://rubyquiz.strd6.com/quizzes.rss

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

## Encyclopedia Construction (#205)

This week's quiz comes from Tim Hunter. Thank you Tim for the great write up!

When I was a kid we had an encyclopedia. It had thousands of articles
in alphabetical order, divided into about 25 volumes. Sometimes there
were so many articles with the same initial letter ("M", for example)
that an entire volume was dedicated to articles with that initial
letter. In some cases, though, there weren't enough articles with the
same initial letter to devote an entire volume to them, so those
articles were grouped in with alphabetically adjacent articles in the
same volume. For example, all the articles starting with W, X, Y, and
Z were in the same volume. Of course, all the articles that started
with the same initial letter were always in the same volume. Each
volume was labeled with the initial character of the articles it
contained. For example, the "M" volume was simply labeled "M". If
the articles were spread over more than one initial character, the
label displayed the first and last characters in the range. The W, X,
Y, and Z volume was labeled "W-Z".

The quiz is to devise a way of grouping a list of encyclopedia articles into
volumes. Since this is Ruby Quiz, we'll say that for "articles" we
have a list of about 100 words composed of alphabetical characters,
a-z. A "volume" is an array.

Sort the list and then put the words into arrays using the following rules.

1. All the words with the same initial character must be in the same array.
2. If an array contains words with different initial characters, the
words must be alphabetically adjacent to each other. That is, if
"cat", "dog", and "elephant", are in the list of articles, and you
put "cat" and "elephant" in the same array, then "dog" must also be in
that array.
3. Without breaking rules 1 and 2, divide the words into about 10 arrays.
4. Without breaking rules 1 and 2, each of the arrays should contain
about the same number of words.

To display the encyclopedia, make a hash with one element for each
array. The element value is the array. The element key is a single
letter ("M") when the array only contains words with the same initial
character, and a range ("W-Z") when the array contains words with more
than one initial character. Print the hash with the "pp" command.

Here's the list of words:

ape
apple
apricot
aquamarine
architect
artichoke
azure
baboon
badger
banana
bat
battleship
beeswax
beetles
black
blogger
blue
bricklayer
Brigadoon
card
cat
cherry
chokeberry
coconut
coral
crimson
currant
dam
debate
deliberations
demagogue
dog
durian
elephant
empathy
encyclopedia
fig
fox
gazelle
giraffe
goldenrod
gray
hippopotamus
huckleberry
ibex
imprint
indigo
ivory
jab
jackrabbit
jake
khaki
kiwi
kudzu
lavender
lemming
leopard
lime
loganberry
mango
maroon
memory
mole
monkey
mouse
muskox
Nigeria
Navajo
olive
opera
panther
party
peach
pear
penguin
persimmon
plum
poet
politician
privy
programmer
rabbit
red
research
rhino
Rumpelstiltskin
salmon
silver
snow
soldier
songsmith
squirrel
starship
steel
strawberry
student
tan
tapir
theater
thesaurus
tree
turquoise
umbrella
warthog
waterloo
wildebeest
wolf
woodchuck
xylophone
yahoo
yellow
zebra
 
F

Frank Fischer

Hi,

here's my solution. It's based on classic dynamic programming over the
number of words for each letter.

I used different objective functions to get good book sizes:

1. Maximize the minimal number of words in one book,
2. minimize the maximal number of words in one book,
3. minimize the absolute deviation of the number of words in one
book to the average number of words in one book ,
4. minimize the quadratic deviation of the number of words in one
book to the average number of words in one book.

For the given word list, objective function 4 seems to yield the best
results with good distributed word counts.

The program is called with three arguments:

1. The name of the word-list file (one word per line),
2. the number of books in which the words should be distributed,
3. a number 1,2,3 or 4 selecting the objective function.

The programm prints

1. the number of words per letter,
2. the words per book,
3. the number of words per book.

The current implementation is quite inelegant and could be improved
and be more rubyish.


---
require 'facets/enumerable'
require 'enumerator'
require 'pp'

word_file = ARGV.shift || "words.txt"
nbooks = ARGV.shift.to_i || 10
algorithm = ARGV.shift.to_i || 3

# ruby 1.8
class Integer
def ord; self; end
end


class Wordlist

attr_reader :books

def initialize( words )
# sort them
@words = words.sort {|w1, w2| w1.downcase <=> w2.downcase }

# get number of words per letter
@nletters = Array.new(26, 0)
@words.each do |w|
@nletters[w.downcase[0].ord - ?a.ord] += 1
end

end


# Returns the number of words per character
def letter_counts
(?A .. ?Z).mash{|c| [c.chr, @nletters[c.ord - ?A.ord]]}
end


# Returns number of words in each book
def counts
@books.mash{|range, words| [range, words.size]}
end


# divide into +nbooks+ books by minimizing the maximal number
# of words in one book
def min_maximum( nbooks )
dyn_min( nbooks ) { |*s| s.max }
end


# divide into +nbooks+ books by maximizing the minmal number
# of words in one book
def max_minimum( nbooks )
dyn_min( nbooks ) { |*s| -s.min }
end


# divide into +nbooks+ books by minimizing the deviation from the
# average number of words in one book
def min_deviat( nbooks )
mean = sum( (e-mail address removed) ).to_f / nbooks
dyn_min( nbooks ) { |*s| s.inject(0){|sum,x| sum + (x - mean).abs} }
end


# divide into +nbooks+ books by minimizing the quadratic deviation
# from the average number of words in one book
def min_deviat2( nbooks )
mean = sum( (e-mail address removed) ).to_f / nbooks
dyn_min( nbooks ) { |*s| s.inject(0){|sum,x| sum + (x - mean) ** 2 } }
end

private

# computes
# sum_{i\in range} @nletters
def sum( range )
range.inject(0) {|s,i| @nletters + s}
end


# A range of letters in the same book.
class Book
def initialize( range, n_words )
@range = range
@n_words = n_words
end

# The first letter in this book
def min; @range.min; end

# The last letter in this book
def max; @range.max; end

# Is letter x in this book?
def include?(x); @range.include?(x); end

# returns the number of all words in this book
attr_reader :n_words
end


# Computes a solution where
# max{ func(part_sum) }
# is minimized
def dyn_min( nbooks, &func )
mean = sum( (e-mail address removed) ).to_f / nbooks

books = Array.new( @nletters.size )
s = 0
(@nletters.size - 1).downto(0) do |i|
s += @nletters
range = i ... @nletters.size
books = [Book.new( range, sum(range) )]
end

nbooks.times do

new_books = Array.new( @nletters.size )

for s in 0 ... @nletters.size
best_value = nil
best_end = nil
sum = @nletters # value of the current part
for e in (s + 1) ... @nletters.size
# stop if no further subdivisions are possible
break unless books[e]

value = func.call(sum, *books[e].map{|r| r.n_words})
if best_value.nil? || value < best_value
best_value = value
best_end = e
end

sum += @nletters[e] || 0
end

if best_value
new_books = [Book.new(s ... best_end,
sum(s ... best_end))] +
books[best_end]
end
end

books = new_books
end

@books = Hash.new
books[0].each do |book|
words = @words.find_all{|w|
book.include?(w.downcase[0].ord - ?a.ord)
}
if book.min == book.max
key = (?A.ord + book.min).chr
else
key = "#{(?A.ord + book.min).chr}-#{(?A.ord + book.max).chr}"
end
@books[key] = words
end

@books
end
end

# read word list
words = []
File.open( word_file, "r" ) do |f|
f.each_line do |line|
line.strip!
words << line unless line.empty?
end
end



list = Wordlist.new( words )
p list.letter_counts

case algorithm
when 1
list.min_maximum( nbooks )
when 2
list.max_minimum( nbooks )
when 3
list.min_deviat( nbooks )
when 4
list.min_deviat2( nbooks )
else
raise "Unknown algorithm: #{algorithm} (should be in {0,1,2,3})"
end

pp list.books
p list.counts
 
L

Luke Cowell

Here's my solution:
http://www.pastie.org/482422


Usage:
e = Encyclopedia.new(articles)
e.volumes(20) #<-- how many volumes should it be condensed to?
pp e.hash

puts e.dump #<-- this will print a nice little chart to help you
visualize the distribution of articles.


My strategy was to:
-divide each group of words starting with the same letter into it's own
array
-if I had to reduce the number of volumes:
-combine 2 adjacent letter arrays into a single array based on which
2 arrays have the smallest combined length
-repeat this procedure until the desired number of volumes is
achieved

Notes:
-This algorithm can result in poorly balanced distribution of volumes
because it glues together adjacent volumes permanently. A more even
distribution could be achieved if you could break a volume off after it
had already been combined.
-Certain number of volumes will result in really even distribution. 17
with the provided articles worked out well.
-I think it would be interesting to split larger volumes as well. I
think in some encyclopedias you'll find volumes with more common words
split (eg. Ma - Me and Mi -Mz)


My code could be much improved - quick and dirty. Well, maybe not so
much quick, but dirty.

Luke
 
F

Frank Fischer

Here's a second version, this time returning the correct number of books
(the first version returned the requested number of books + 1). And it
should be a little cleaner and faster.

Frank

--
require 'facets/enumerable'
require 'enumerator'
require 'pp'

word_file = ARGV.shift || "words.txt"
nbooks = (ARGV.shift || 10).to_i
algorithm = (ARGV.shift || 4).to_i

# ruby 1.8
if RUBY_VERSION[0,4] == "1.8."
class Integer
def ord; self; end
end
end


# A book containing the words starting with a letter in first .. last
class Book

attr_reader :first, :last, :n_words, :words

# Create a book with n words whose first letter is in range.
def initialize( range, n_words )
@range = range.min.ord .. range.max.ord
@n_words = n_words
end


# Set the words in this book
def words=( words )
@words = words.find_all{|w| include?(w.downcase[0])}
@n_words = @words.size
end


# Returns true if the words starting with the given letter are in
# this book
def include?( letter )
@range.include?( letter.ord - ?a.ord )
end

# Returns the character-range.
def range
if @range.min == @range.max
(?A.ord + @range.min).chr
else
(?A.ord + @range.min).chr + "-" + (?A.ord + @range.max).chr
end
end


def to_s; "<Book: %3s, size: #@n_words>" % range; end

def inspect; to_s; end

end


class Wordlist

attr_reader :books

def initialize( words )
# sort them
@words = words.sort {|w1, w2| w1.downcase <=> w2.downcase }

# get number of words per letter
@nletters = Array.new( 26, 0 )
@words.each do |w|
@nletters[w.downcase[0].ord - ?a.ord] += 1
end
end


# Returns the number of words per character
def letter_counts
([email protected]).mash{|i| [(?a.ord + i).chr, @nletters]}
end


# divide into +nbooks+ books by minimizing the maximal number
# of words in one book
def min_maximum( nbooks )
dyn_min( nbooks ) { |s, rest| [s, rest || 0].max }
end


# divide into +nbooks+ books by maximizing the minmal number
# of words in one book
def max_minimum( nbooks )
dyn_min( nbooks ) { |s, rest| [-s, (rest || -s)].max }
end


# divide into +nbooks+ books by minimizing the deviation from the
# average number of words in one book
def min_deviat( nbooks )
mean = sum( 0 ... @nletters.size ).to_f / nbooks
dyn_min( nbooks ) { |s, rest| (rest || 0) + (s - mean).abs }
end


# divide into +nbooks+ books by minimizing the quadratic deviation
# from the average number of words in one book
def min_deviat2( nbooks )
mean = sum( 0 ... @nletters.size ).to_f / nbooks
dyn_min( nbooks ) { |s, rest| (rest || 0) + (s - mean) ** 2 }
end


private

# computes
# sum_{i\in range} @nletters
def sum( range )
range.inject(0) {|s,i| @nletters + s}
end


# Computes a solution minimizing the objective function.
def dyn_min( nbooks, &func )
# create initial books (1 book)
books = Array.new( @nletters.size )
s = 0
(@nletters.size - 1).downto(0) do |i|
s += @nletters
range = i ... @nletters.size
sum = sum(range)
books = [[Book.new( range, sum )], func.call(sum, nil)]
end

(nbooks - 1).times do
books = (0 ... @nletters.size).map { |s|
best_value = nil
best_end = nil
sum = @nletters # value of the current part
for e in s.next ... @nletters.size
if books[e]
value = func.call(sum, books[e][1])
if best_value.nil? || value < best_value
best_value = value
best_end = e
end
end
sum += @nletters[e] || 0
end

if best_value
[[Book.new(s...best_end, sum(s...best_end))] +
books[best_end][0],
best_value]
else
nil
end
}

end

@books = books[0][0]
# collect the words into the books
@books.each do |book|
book.words = @words
end
end
end

# read word list
words = []
File.open( word_file, "r" ) do |f|
f.each_line do |line|
line.strip!
words << line unless line.empty?
end
end



list = Wordlist.new( words )
p list.letter_counts
puts "Number of words: #{list.letter_counts.values.inject(0){|s,x| s + x}}"

case algorithm
when 1
list.min_maximum( nbooks )
when 2
list.max_minimum( nbooks )
when 3
list.min_deviat( nbooks )
when 4
list.min_deviat2( nbooks )
else
raise "Unknown algorithm: #{algorithm} (should be in {1,2,3,4})"
end

pp list.books.mash{|book| [book.range, book.words]}
p list.books.mash{|book| [book.range, book.n_words]}
--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top