[SUMMARY] Longest Repeated Substring (#153)

Discussion in 'Ruby' started by Ruby Quiz, Jan 24, 2008.

  1. Ruby Quiz

    Ruby Quiz Guest

    I first saw this problem years ago. It was one of the Perl's Quiz of the Week
    problems. I remember being surprised at how the solution was sort of
    counter-intuitive to me. I mean that my first thought was: start at half the
    string size, try to find a string of that length that occurs twice in the full
    text, reduce the length by one character and recheck until you find a match.
    Yoan Blanc coded that strategy up for us:

    text = STDIN.read

    (text.length/2).downto 1 do |l|
    match = Regexp.new("(.{#{l}})\\1").match(text)
    if match
    puts text[match.offset(1)[0]..(match.offset(1)[1]-1)]
    break
    end
    end

    Technically the regular expression engine will choke on this solution for a
    significantly large text. The reason is that quantifiers in {…} limits are
    only allowed to be so large. The theory is as I described though.

    However, even if it could match any size, this solution turns out to be far too
    slow to be practical. The reason is that given a megabyte or more of text, the
    longest repeated substring is typically less than 200 bytes. That means you do
    a whole lot of wasted checks before you are even in the range that a match can
    occur.

    Yoan realized this and submitted a second solution that flipped the search
    around. Starting with a match of zero characters and finding the next biggest
    puts you in the right neighborhood from the get go.

    The other key realized in Yoan's second solution is that if we know there is a
    four character repeated substring at some index, we can also count on the fact
    that there are one, two, and three character repeated substrings at the same
    index. Given that, you can keep checking at the same index until you don't find
    a match for a given size. The size found just before that was the biggest at
    that index.

    The Perl guys found further means to optimize this approach, by trimming the
    search space. However, this too breaks down in the face of significantly large
    inputs. That's why we have suffix trees.

    The idea of a suffix tree is to build a list of every suffix that occurs in the
    text. Thus the quiz example "banana" breaks down to:

    banana
    anana
    nana
    ana
    na
    a

    That may not look like a lot of help yet, but watch what happens if we sort the
    entries:

    a
    ana
    anana
    banana
    na
    nana

    See how the common prefixes group together? We can now compare adjacent entries
    of our suffix tree for prefixes they have in common. In this case, the second
    and third element share an "an" and that's one of the possible answers. Note
    that the second and third share "ana" as well, but selecting that one causes
    overlap:

    [ana]na
    [ana]

    There's a catch though. Consider this example input from Eric I.: ananana.
    The suffix tree is:

    ananana
    nanana
    anana
    nana
    ana
    na
    a

    That sorts into:

    a
    ana
    anana
    ananana
    na
    nana
    nanana

    In this case comparing just the second and third entries gives us "an", because
    "ana" would overlap. But, if you compare the second and fourth entries, "ana"
    becomes a valid answer. That tells us that it's sometimes necessary to look
    ahead more than just one step.

    Enough explaining, let's read some code. I'm going to show Eric's solution
    below, because it was very well commented and that helped me understand it.
    However, I'm going to remove those comment in the following listings in the
    interests of space. I also made a couple of tiny edits that I felt simplified
    the code a touch.

    First we have a simple helper method:

    def longest_common_prefix(s1, s2, max = nil)
    min = [s1.size, s2.size, max].compact.min
    min.times do |i|
    return s1.slice(0, i) if s1 != s2
    end
    return s1.slice(0, min)
    end

    # ...

    Given two Strings, this code will find the longest prefix they share. It begins
    by finding the smallest length among the passed Strings and an optional provided
    maximum. It then walks those indexes looking for mismatched characters.

    When it finds a mismatch, it returns what came before. How it does that is a
    little clever though, since it seems to use the same numbers. Consider that we
    are comparing "abc" and "abd" on the third character. That means i = 2 and
    s1 != s2. That will cause the code to return s1.slice(0, i), but remember
    that i is used as a length there, not an index. That's why we get the first two
    characters back ("ab").

    If no mismatches arise during the search, they match all the way to our limit
    and that is returned.

    The next method is where all the action is and it's a doozy, so we will take it
    in pieces. The first chunk builds and sorts the suffix tree:

    # ...

    def longest_repeated_substring(string)
    size = string.length

    suffixes = Array.new(size)
    size.times do |i|
    suffixes = string.slice(i, size)
    end

    suffixes.sort!

    # ...

    This code might seem wildly inefficient (memory-wise) at first glance, but Ruby
    will surprise you here. When you slice() a substring like this, Ruby cheats and
    doesn't actually make a copy of the text. Instead, the new object is a pointer
    into the old object. Now, if you change either String down the road, Ruby must
    and will finish the job, turning it into a full copy. We call this behavior
    "Copy on Write." Since we won't be changing these Strings though, just
    examining them, we're safe with the tiny pointers for the duration.

    Note that the size variable is always used as the substring length here, though
    it's too long in all but the first case. Ruby will stop at the end of the
    String even if you provide a longer length, so it does the same thing and avoids
    unneeded math or Range object construction.

    OK, let's begin the search through the tree:

    # ...

    best = ""
    at_least_size = 1 # the size to meet or exceed to be the new best
    distance = nil
    neighbors_to_check = 1

    (1...size).each do |i|
    s1 = suffixes

    neighbors_to_check.downto(1) do |neighbor|
    s2 = suffixes[i - neighbor]

    distance = (s1.size - s2.size).abs
    if distance < at_least_size
    if s1.size >= at_least_size &&
    s2.size >= at_least_size &&
    s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
    neighbors_to_check = [neighbors_to_check, neighbor + 1].max
    else
    neighbors_to_check = neighbor
    end
    next
    end

    # ...

    First, some variables are set to hold the best match so far, the size a match
    would need to be to beat it, the distance between the substrings, and how far
    down the tree we need to search. After, that we enter a simple loop trying each
    suffix.

    The neighbors_to_check iteration is our look ahead for similar suffixes that
    share our prefix. We usually only need to look one step ahead, but these checks
    watch for overlap cases and extends our search when needed.

    Note that we figure a distance between substrings here, even though we didn't
    remember where they started. That's possible because all suffixes are anchored
    to the far edge of our original text. Knowing that, comparing lengths is the
    same as comparing starting indexes.

    A distance below our needed match size warns us that there is overlap and we may
    need to look further ahead in this case. The if conditions just ensure the
    Strings are of the length we should even consider and that they match at least
    that much. If all of that turns out to be true, our search ahead is extended.

    Now we are ready to do the actual comparisons:

    # ...

    unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
    neighbors_to_check = neighbor
    next
    end

    best = longest_common_prefix(s1, s2, distance)
    at_least_size = best.size + 1
    if best.size == distance
    neighbors_to_check = [neighbors_to_check, neighbor + 1].max
    else
    neighbors_to_check = neighbor
    end
    end
    end

    best
    end

    # ...

    The unless check we see here is just a short circuit optimization. There's no
    need to check for common prefixes if the Strings aren't a match at least as long
    as our current best.

    If we've made it this far, we know the current two Strings share a prefix that
    meets or exceeds our current best. A hand-off is made to
    longest_common_prefix() to grab our new best and the rest of the iterator resets
    the search parameters.

    The best we found in the entire search is then returned as the solution.

    Here's the interface code that wraps this method:

    # ...

    if $0 == __FILE__
    string = nil
    if ARGV[0] == "-f"
    string = File.read(ARGV[1])
    elsif ARGV.size == 0
    string = STDIN.read
    elsif ARGV[0] =~ /^-/ || ARGV.size > 1
    STDERR.puts "usage:"
    STDERR.puts " #{$0} (note: input comes from standard input)"
    STDERR.puts " #{$0} string"
    STDERR.puts " #{$0} -f filename"
    exit
    else
    string = ARGV[0]
    end

    result = longest_repeated_substring(string)
    puts %Q{"#{result}" (#{result.length} characters)}
    end

    Most of the above code just sorts out the command-line arguments. The checks
    determine whether to read the text from a file, STDIN, or the arguments
    themselves.

    The last two lines call the workhorse method we just finished examining and
    print details about the best match found.

    My thanks to all who stretched my brain with all of the excellent discussion
    about this week's problem.

    Tomorrow we will try Ruby's hand at coin counting...
    Ruby Quiz, Jan 24, 2008
    #1
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    361
  2. Joerg Schuster

    leftmost longest match (of disjunctions)

    Joerg Schuster, Dec 1, 2003, in forum: Python
    Replies:
    12
    Views:
    697
    Joerg Schuster
    Dec 3, 2003
  3. Ruby Quiz
    Replies:
    52
    Views:
    647
  4. Replies:
    3
    Views:
    195
    Sherm Pendley
    Aug 3, 2005
  5. Henry Townsend

    longest common substring

    Henry Townsend, Nov 1, 2006, in forum: Perl Misc
    Replies:
    4
    Views:
    196
    Peter Scott
    Nov 2, 2006
Loading...

Share This Page