[QUIZ] Longest Repeated Substring (#153)

T

tho_mica_l

Hi,

With some delay, here is my solution. I started on Friday with my
take
on making suffix trees respect overlaps but then stopped. I have to
admit that it was Eric's solution that made take another look on
this.
The current version is about 3 times slower than Eric's submission.
It
passes all my test cases though. It also takes the idea from (IIRC)
Dave
Thomas's submission to calculate the original start position on the
basis of the string's length.

BTW it seems to me that ruby19 (cygwin, win xp) uses a __lot__ more
memory for this.

Regards,
Thomas.


#!/usr/bin/env ruby

class String
def longest_substring_st4
suffixes = Array.new(size) {|i| self[i..size]}
suffixes.sort!
common = ''
comlen = 0
suffixes.each_with_index do |curr, index|
next if index == 0
curridx = size - curr.size
pindex = index - 1
pindex.downto(pindex - comlen) do |i|
pred = suffixes
psize = pred.size
predidx = size - psize
maxlen = [(predidx - curridx).abs, psize].min - 1
next if maxlen < comlen
prefix = pred[0 .. comlen]
break if prefix != curr[0..comlen]
(comlen + 1 .. maxlen).each do |i|
p = pred
c = curr
if p == c
prefix << p
else
break
end
end
common = prefix
comlen = common.size
break if comlen <= maxlen
end
end
return common
end
end


if __FILE__ == $0
if ARGV.empty?
puts String.new(STDIN.read).longest_substring_st4
else
ARGV.each {|f| puts
String.new(File.read(f)).longest_substring_st4}
end
end
 
A

Adam Shelly

I had the idea to use suffix trees too, but I literally built a tree.
Each node holds a start index and length for a suffix.
As I add more substrings, I count the number of matching characters
with any previous suffixes, and record the max as needed (after
handling overlap).

Unfortunately, this turns out to be slower that the ones that just
sort all the strings,
and runs out of memory on the Illiad...

-Adam

---

class String
#helper - counts number of matching characters.
def matchlen other
i=0
other.each_byte{|b|
break if b!=self
i+=1
}
i
end
end

class Node
def initialize start,len,tail=nil
@i=start
@l=len
@h=tail ? tail : {}
end

def insert idx,len,matched=0
match = @h[$STR[idx]]
#add or expand child
if match
match.expand(idx,len,matched)
else
@h[$STR[idx]]=Node.new(idx,len)
end
end

def expand idx,len,matched
#count matching characters
matchlen = $STR[@i,@l].matchlen($STR[idx,len])

updateMax(idx-matched, matchlen+matched) if matchlen+matched > $max
if matchlen < @l
#split if partial match
split_at(matchlen, idx,len)
else
#else add remainder of unmatched characters
insert(idx+@l, len-@l, matchlen+matched)
end
end

def split_at point, idx,len
#one child contains the remainder of the original string(s)
newchild = Node.new(@i+point,@l-point, @h)
@h={}
@h[$STR[@i+point]]=newchild
#the other child has the remainder of the new str
@h[$STR[idx+point]]=Node.new(idx+point, len-point)
@l=point #shorten our length
end

def updateMax idx,matchlen
#if our string ends past the beginining of the match,
# discount the overlap
overlap = @i+@l-idx-1
matchlen-=overlap if overlap > 0
if matchlen>$max
$max = matchlen
$maxpt = idx
end
end
end


$STR=ARGF.read
$max=0
$maxpt=0
slen = $STR.length
half= (slen/2.0).ceil
@root=Node.new(0,0)

#we don't have to look at any substrings longer than half the input
(0..half).each{|i|
@root.insert(i,half)
}
#and the ones that start in the second half of the input are shorter still
(half+1..slen).each{|i|
len = slen-i
break if len<$max
@root.insert(i,len)
}

puts $STR[$maxpt,$max].inspect
puts "\n#{$max} chars"
 
T

Todd Benson

In my copy of the Illiad (there's a phrase I never thought I'd use!),
those two sections differ in whitespace / linebreaks:

On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but not
so Agamemnon, who spoke fiercely to him and sent him roughly away.
"Old man," said he, "let me not find you tarrying about our ships, nor
yet coming hereafter.

and

"On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so
Agamemnon, who spoke fiercely to him and sent him roughly away. So
he went back in anger, and Apollo, who loved him dearly, heard his
prayer.

Did you get yours from http://www.fordham.edu/halsall/ancient/homer-illiad.txt

http://www.gutenberg.org/dirs/etext00/iliad10.txt

It goes to show that any text on the "web" is spurious :)

Side note and totally off topic... why do people like to use two el's
in "The Iliad"?

Todd
 
A

Alex Shulgin

My hack at the substring problem (nice one, James) is based on suffix
trees.

Suffix trees and suffix arrays and a great tool for substring
operations. The idea is to create a list of all the possible suffixes
in the original string. For "banana" this would be

banana
anana
nana
ana
na
a

Because the list contains an entry starting at every position in the
string, we know that all possible substrings of the original string
must occur at the start of one of these list elements. The substring
"nan", for example, occurs at the start of the 3rd entry.

Now sort them

a
ana
anana
banana
na
nana

Now we know that all substrings that start with the same sequence of
characters must occur at the start of items in the list that are
adjacent. Searching through to list to find these is a linear operation.

Hi Dave,

Thanks for the brilliant idea--this time I really feel like I've
learned something new! :)

I even took my time to implement it in C (80 lines) and I must admit
that ruby performs very well in this case: my C version is only about
6 times faster than your ruby...

However, with the algorithm presented, I don't see how one can obtain
the correct result for the "aaaa" case. Lets see; the suffix list
would be:

aaaa
aaa
aa
a

now, sorted:

a
aa
aaa
aaaa

And the catch is--in order to obtain the correct result, that is "aa",
one needs to observe "aa" and "aaaa", but these are not adjacent. In
this case two pairs of adjacent suffixes ("aa" and "aaa", "aaa" and
"aaaa") might give "aa", but they do overlap.

So do we need a special-case treatment here?
 
J

Jan

[Note: parts of this message were removed to make it a legal post.]

Hi Alex,

Have the same problem with you :)

I think we should understand the 'adjacent' as 'not overlapped adjacent'
here. Each time we take two adjacent suffix, we check whether they
overlapped first - if so, make s2 point to next suffix, until we found a not
overlapped one.

For aaaa: on the step of comparing aa and aaa, we found they overlapped, so
we keep s1 = aa, but make s2 = aaaa, the one next to aaa.

Thanks
Jan
 
U

Urs Meyer

Hi there,

thought first of two loops, one that moves the start position
of the candidate substring, and another that makes it longer.
It turned out that one loop that combines both is sufficient.

Without considering searching (text.index) the loop is O(n).
However, the index method is O(n). Thus, worst case (for my
program) is O(n^2).

Can one do it in O(n*log(n))?

--

#!/usr/bin/ruby
#
# ruby quiz 153 - longest repeated substring
#
# print the first one found if several such substrings exist
#
# Urs Meyer, 2008-01-22

--

# read text from STDIN, convert to lower case, as you like
text = STDIN.read.tr("\nA-Z"," a-z")

# start position, determines the remaining search space
start = 0
longest_substring = ""

# search while remaining search space is at least twice the
# the size of the currently longest substring

while (2 * longest_substring.size) < (text.length - start)

# generate substring to search for with size is one bigger
# than longest found so far
substring = text[start...(start+longest_substring.size+1)]

# search for it
i = text.index(substring, start+substring.size)

if i.nil?
# nothing found, advance start position
start += 1
else
# found a longer one, record it
longest_substring = substring
end
end

puts longest_substring

--

$ echo banana | ruby lrs.rb
an
$ echo aa | ruby lrs.rb
a
$ echo aaa | ruby lrs.rb
a
$ echo aaaa | ruby lrs.rb
aa
$

some timings...

using random characters:
$ ruby -e "puts Array.new(100000){('a'[0]+rand(26)).chr}.join" | ( time ruby lrs.rb > /dev/null )
21.015u 0.046s 0:21.39 98.4% 0+0k 0+0io 1636pf+0w

with just a bunch of 'a's the job is easier:
$ ruby -e "puts 'a'*100000" | ( time ruby lrs.rb > /dev/null )
3.390u 0.015s 0:03.50 97.1% 0+0k 0+0io 3705pf+0w

lets see whether the result is correct:
$ ruby -e "puts 'a'*100000" | ruby lrs.rb | wc -c
50001

oops, but
$ echo "a" | wc -c
2

o.k. we are fine

if you are curious, I ran it a few times on 100000 random chars:
lnxmfqu
nvpban
ibhwilj
eggfur
mbljqx
gzvxhw
...

~~
Regards
Urs
 
D

Denis Hennessy

Side note and totally off topic... why do people like to use two el's
in "The Iliad"?

Long version: because the capital-I looks identical to the lower case
L so people think they've seen two 'L's. Of course they also heard an
'I' so they insert that first.

Short version: Iggnorance.

/dh
 
E

Eric I.

Dave's code can be corrected by realizing that since all suffix
strings end at the same place (the end of the string), then of the two
adjacent strings being tested, one is a suffix of the other.

This means that to detect overlap, the following test can be used:

  if prefix.length + s1.length > s2.length then
    # Overlap
  end

where "prefix" is the current prefix being checked in the two adjacent
suffix strings.
...

Adding this test (instead of the s2.length / 2 test) and also testing
adjacent strings that start with the prefix currently being searched
(to find later matches if earlier ones overlap) would correct Dave's
solution and shouldn't be much more complicated.

That's very close to what I do, but I go in the opposite direction. I
(try to) track the size window of how many previous sorted suffixes
must be checked against the last in the sequence.

So if I'm comparing substrings a and b in [..., a, b, ...], and if the
amount of non-overlap is smaller than the longest found substring so
far, then I enlarge the window. So now when I'm focusing on c in
[..., a, b, c, ...] I'll compare it to both a and b.

There needs to be a way to shrink the window as well, so it doesn't
grow monotonically, and so if the a and c don't have the same prefix
which exceeds the length of the longest found substring so far, the
window shrinks.

Part of my complexity comes from trying to detect cases where I can
avoid the full comparison early, and so every time I try to do the
avoid (by using "next" to start the next iteration), I must also do
some window management.

Maybe re-orienting the code in your way would simplify things. If you
have the time, I'd be very curious to see what you would come up with.

Eric
 
T

Todd Benson

Long version: because the capital-I looks identical to the lower case
L so people think they've seen two 'L's. Of course they also heard an
'I' so they insert that first.

Short version: Iggnorance.

Okay, this is an ignorant reply. I'm really sorry to the people that
actually want to program. Just delete this :)

Your previous link has two "L's". And many of the other posts do as
well. And, I am pretty sure it has nothing to do with fonts.

Todd
 
T

Todd Benson

Okay, this is an ignorant reply. I'm really sorry to the people that
actually want to program. Just delete this :)

I meant "this will be an ignorant reply". It is hard to throw candor
correctly over the internet.

Has anyone figured out why my text result differs from others?
Spacing/newlines, obviously, is a factor, but I get the same result on
every machine. But then, if you think about it, then to use something
like this, you _really_ have to trust the data.

Todd
 
T

tho_mica_l

Your previous link has two "L's". And many of the other posts do as
well. And, I am pretty sure it has nothing to do with fonts.

I think, it's "Ilias" in ancient Greek anyway. It's quite a common
mistake/transliteration.
 
A

Alex Shulgin

[Note: parts of this message were removed to make it a legal post.]

Hi Alex,

Have the same problem with you :)

I think we should understand the 'adjacent' as 'not overlapped adjacent'
here. Each time we take two adjacent suffix, we check whether they
overlapped first - if so, make s2 point to next suffix, until we found a not
overlapped one.

Hm... but every pair of suffixes _are_ overlapped just 'cause all of
them last to the end of the text, so we cannot 'check first'.
Anyway...
For aaaa: on the step of comparing aa and aaa, we found they overlapped, so
we keep s1 = aa, but make s2 = aaaa, the one next to aaa.

This might be an appropriate hack--I'll try it when I have a
minute. :)
 
J

Jan

[Note: parts of this message were removed to make it a legal post.]

[Note: parts of this message were removed to make it a legal post.]

Hi Alex,

Have the same problem with you :)

I think we should understand the 'adjacent' as 'not overlapped adjacent'
here. Each time we take two adjacent suffix, we check whether they
overlapped first - if so, make s2 point to next suffix, until we found a not
overlapped one.

Hm... but every pair of suffixes _are_ overlapped just 'cause all of
them last to the end of the text, so we cannot 'check first'.
Anyway...


Yes u'r right ... I should have described more clear: take two adjacent,
check whether the "head" of them overlapped:

overlap?(s1[0, current_longest_substr.length], s2[0,
current_longest_substr.length])

Thanks
Jan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top