[SUMMARY] Longest Repeated Substring (#153)

R

Ruby Quiz

I first saw this problem years ago. It was one of the Perl's Quiz of the Week
problems. I remember being surprised at how the solution was sort of
counter-intuitive to me. I mean that my first thought was: start at half the
string size, try to find a string of that length that occurs twice in the full
text, reduce the length by one character and recheck until you find a match.
Yoan Blanc coded that strategy up for us:

text = STDIN.read

(text.length/2).downto 1 do |l|
match = Regexp.new("(.{#{l}})\\1").match(text)
if match
puts text[match.offset(1)[0]..(match.offset(1)[1]-1)]
break
end
end

Technically the regular expression engine will choke on this solution for a
significantly large text. The reason is that quantifiers in {…} limits are
only allowed to be so large. The theory is as I described though.

However, even if it could match any size, this solution turns out to be far too
slow to be practical. The reason is that given a megabyte or more of text, the
longest repeated substring is typically less than 200 bytes. That means you do
a whole lot of wasted checks before you are even in the range that a match can
occur.

Yoan realized this and submitted a second solution that flipped the search
around. Starting with a match of zero characters and finding the next biggest
puts you in the right neighborhood from the get go.

The other key realized in Yoan's second solution is that if we know there is a
four character repeated substring at some index, we can also count on the fact
that there are one, two, and three character repeated substrings at the same
index. Given that, you can keep checking at the same index until you don't find
a match for a given size. The size found just before that was the biggest at
that index.

The Perl guys found further means to optimize this approach, by trimming the
search space. However, this too breaks down in the face of significantly large
inputs. That's why we have suffix trees.

The idea of a suffix tree is to build a list of every suffix that occurs in the
text. Thus the quiz example "banana" breaks down to:

banana
anana
nana
ana
na
a

That may not look like a lot of help yet, but watch what happens if we sort the
entries:

a
ana
anana
banana
na
nana

See how the common prefixes group together? We can now compare adjacent entries
of our suffix tree for prefixes they have in common. In this case, the second
and third element share an "an" and that's one of the possible answers. Note
that the second and third share "ana" as well, but selecting that one causes
overlap:

[ana]na
[ana]

There's a catch though. Consider this example input from Eric I.: ananana.
The suffix tree is:

ananana
nanana
anana
nana
ana
na
a

That sorts into:

a
ana
anana
ananana
na
nana
nanana

In this case comparing just the second and third entries gives us "an", because
"ana" would overlap. But, if you compare the second and fourth entries, "ana"
becomes a valid answer. That tells us that it's sometimes necessary to look
ahead more than just one step.

Enough explaining, let's read some code. I'm going to show Eric's solution
below, because it was very well commented and that helped me understand it.
However, I'm going to remove those comment in the following listings in the
interests of space. I also made a couple of tiny edits that I felt simplified
the code a touch.

First we have a simple helper method:

def longest_common_prefix(s1, s2, max = nil)
min = [s1.size, s2.size, max].compact.min
min.times do |i|
return s1.slice(0, i) if s1 != s2
end
return s1.slice(0, min)
end

# ...

Given two Strings, this code will find the longest prefix they share. It begins
by finding the smallest length among the passed Strings and an optional provided
maximum. It then walks those indexes looking for mismatched characters.

When it finds a mismatch, it returns what came before. How it does that is a
little clever though, since it seems to use the same numbers. Consider that we
are comparing "abc" and "abd" on the third character. That means i = 2 and
s1 != s2. That will cause the code to return s1.slice(0, i), but remember
that i is used as a length there, not an index. That's why we get the first two
characters back ("ab").

If no mismatches arise during the search, they match all the way to our limit
and that is returned.

The next method is where all the action is and it's a doozy, so we will take it
in pieces. The first chunk builds and sorts the suffix tree:

# ...

def longest_repeated_substring(string)
size = string.length

suffixes = Array.new(size)
size.times do |i|
suffixes = string.slice(i, size)
end

suffixes.sort!

# ...

This code might seem wildly inefficient (memory-wise) at first glance, but Ruby
will surprise you here. When you slice() a substring like this, Ruby cheats and
doesn't actually make a copy of the text. Instead, the new object is a pointer
into the old object. Now, if you change either String down the road, Ruby must
and will finish the job, turning it into a full copy. We call this behavior
"Copy on Write." Since we won't be changing these Strings though, just
examining them, we're safe with the tiny pointers for the duration.

Note that the size variable is always used as the substring length here, though
it's too long in all but the first case. Ruby will stop at the end of the
String even if you provide a longer length, so it does the same thing and avoids
unneeded math or Range object construction.

OK, let's begin the search through the tree:

# ...

best = ""
at_least_size = 1 # the size to meet or exceed to be the new best
distance = nil
neighbors_to_check = 1

(1...size).each do |i|
s1 = suffixes

neighbors_to_check.downto(1) do |neighbor|
s2 = suffixes[i - neighbor]

distance = (s1.size - s2.size).abs
if distance < at_least_size
if s1.size >= at_least_size &&
s2.size >= at_least_size &&
s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
neighbors_to_check = [neighbors_to_check, neighbor + 1].max
else
neighbors_to_check = neighbor
end
next
end

# ...

First, some variables are set to hold the best match so far, the size a match
would need to be to beat it, the distance between the substrings, and how far
down the tree we need to search. After, that we enter a simple loop trying each
suffix.

The neighbors_to_check iteration is our look ahead for similar suffixes that
share our prefix. We usually only need to look one step ahead, but these checks
watch for overlap cases and extends our search when needed.

Note that we figure a distance between substrings here, even though we didn't
remember where they started. That's possible because all suffixes are anchored
to the far edge of our original text. Knowing that, comparing lengths is the
same as comparing starting indexes.

A distance below our needed match size warns us that there is overlap and we may
need to look further ahead in this case. The if conditions just ensure the
Strings are of the length we should even consider and that they match at least
that much. If all of that turns out to be true, our search ahead is extended.

Now we are ready to do the actual comparisons:

# ...

unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
neighbors_to_check = neighbor
next
end

best = longest_common_prefix(s1, s2, distance)
at_least_size = best.size + 1
if best.size == distance
neighbors_to_check = [neighbors_to_check, neighbor + 1].max
else
neighbors_to_check = neighbor
end
end
end

best
end

# ...

The unless check we see here is just a short circuit optimization. There's no
need to check for common prefixes if the Strings aren't a match at least as long
as our current best.

If we've made it this far, we know the current two Strings share a prefix that
meets or exceeds our current best. A hand-off is made to
longest_common_prefix() to grab our new best and the rest of the iterator resets
the search parameters.

The best we found in the entire search is then returned as the solution.

Here's the interface code that wraps this method:

# ...

if $0 == __FILE__
string = nil
if ARGV[0] == "-f"
string = File.read(ARGV[1])
elsif ARGV.size == 0
string = STDIN.read
elsif ARGV[0] =~ /^-/ || ARGV.size > 1
STDERR.puts "usage:"
STDERR.puts " #{$0} (note: input comes from standard input)"
STDERR.puts " #{$0} string"
STDERR.puts " #{$0} -f filename"
exit
else
string = ARGV[0]
end

result = longest_repeated_substring(string)
puts %Q{"#{result}" (#{result.length} characters)}
end

Most of the above code just sorts out the command-line arguments. The checks
determine whether to read the text from a file, STDIN, or the arguments
themselves.

The last two lines call the workhorse method we just finished examining and
print details about the best match found.

My thanks to all who stretched my brain with all of the excellent discussion
about this week's problem.

Tomorrow we will try Ruby's hand at coin counting...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top