Determining the common prefix for several strings

T

Todd Burch

I have several strings to analyze. The strings are the names of a list
of things that were once ordered and now are not. A typical array of
names might look like this:

item001
item004
item002
item002b
item002C
item 5
item 10
itemize this

I want to parse these names and determine that "item" is the common
prefix.

I know how to armstrong my way through it a character at a time, but I'm
thinking that Ruby probably has a very elegant way of doing this that
I'm not familiar with yet.

Thanks, Todd
 
M

Michael Glaesemann

I have several strings to analyze. The strings are the names of a
list
of things that were once ordered and now are not. A typical array of
names might look like this:

item001
item004
item002
item002b
item002C
item 5
item 10
itemize this

I want to parse these names and determine that "item" is the common
prefix.

I know how to armstrong my way through it a character at a time,
but I'm
thinking that Ruby probably has a very elegant way of doing this that
I'm not familiar with yet.

I don't know of a particularly elegant way to do this in Ruby, but my
initial thoughts are

list = <<-EOS
item001
item004
item002
item002b
item002C
item 5
item 10
itemize this
EOS
# sort by length: prefix can't be longer than shortest string.
items = list.split("\n").sort_by { |i| i.length }
# use shortest as prefix
first = items.shift
items.inject(first) do |prefix, item|
next prefix if item.match(/^#{prefix}/)
# have to go char by char
n = 0
for i in 0..(prefix.length - 1) do
break if prefix != item
n = i
end
prefix[0..n]
end

Michael Glaesemann
grzm seespotcode net
 
X

Xavier Noria

El Aug 3, 2007, a las 2:31 AM, Todd Burch escribi=F3:
I have several strings to analyze. The strings are the names of a =20
list
of things that were once ordered and now are not. A typical array of
names might look like this:

item001
item004
item002
item002b
item002C
item 5
item 10
itemize this

I want to parse these names and determine that "item" is the common
prefix.

I know how to armstrong my way through it a character at a time, =20
but I'm
thinking that Ruby probably has a very elegant way of doing this that
I'm not familiar with yet.

This is a solution that avoids going char by char:

items =3D [
'item001',
'item004',
'item002',
'item002b',
'item002C',
'item 5',
'item 10',
'itemize this'
]

min_length =3D items.map {|i| i.length}.min
cuts =3D items.map {|i| i.split(//)[0, min_length]}
cuts.transpose.each_with_index do |t, i|
if t.uniq.length !=3D 1
puts cuts.first[0, i].join('')
break
end
end

We first check which is the length of the shortest string in the =20
array, then split by chars to store those many ones, and then =20
transpose the matrix. Note that transposing is OK because we cut the =20
strings. That's longer to explain than to write it in Ruby as you =20
see. Once you get the transpose you can go row by row to detect the =20
one that has more than one distinct element. Again that's easy =20
delegating the job to uniq.

-- fxn
 
X

Xavier Noria

El Aug 3, 2007, a las 4:26 AM, Xavier Noria escribi=F3:
min_length =3D items.map {|i| i.length}.min
cuts =3D items.map {|i| i.split(//)[0, min_length]}
cuts.transpose.each_with_index do |t, i|
if t.uniq.length !=3D 1
puts cuts.first[0, i].join('')
break
end
end

Hmmm, too convoluted. I realized you don't need the transpose and =20
thus the min length and cuts if you go by columns in the loop. This =20
is the same idea only more simple:

splits =3D items.map {|i| i.split(//)}
(0...splits.length).each do |i|
if splits.map {|s| s}.uniq.length !=3D 1
puts splits.first[0, i].join('')
break
end
end

-- fxn
 
X

Xavier Noria

El Aug 3, 2007, a las 4:41 AM, Xavier Noria escribi=F3:
Hmmm, too convoluted. I realized you don't need the transpose and =20
thus the min length and cuts if you go by columns in the loop. This =20=
is the same idea only more simple:

splits =3D items.map {|i| i.split(//)}
(0...splits.length).each do |i|
if splits.map {|s| s}.uniq.length !=3D 1
puts splits.first[0, i].join('')
break
end
end


Ah, and the prefix can be extracted from the original string instead =20
of the split:

splits =3D items.map {|i| i.split(//)}
(0...splits.length).each do |i|
if splits.map {|s| s}.uniq.length !=3D 1
puts items.first[0, i]
break
end
end

-- fxn
 
X

Xavier Noria

El Aug 3, 2007, a las 4:46 AM, Xavier Noria escribi=F3:
splits =3D items.map {|i| i.split(//)}
(0...splits.length).each do |i|
if splits.map {|s| s}.uniq.length !=3D 1
puts items.first[0, i]
break
end
end


There's a bug in the upper limit of the range, it should be the =20
length of some element of the splits

splits =3D items.map {|i| i.split(//)}
(0...splits.first.length).each do |i|
if splits.map {|s| s}.uniq.length !=3D 1
puts items.first[0, i]
break
end
end

We could compute the length of the shortest string and use that finer =20=

upper limit, but that's not necessary because even in the case that =20
the shortest string is the common prefix, we would get some nil in =20
the map through columns in the next iteration, and thus at least two =20
distinct values. On the other hand there's the corner case when all =20
strings are equal, that's not handled in the above code because they =20
way to do it depends on the details.

Please forgive those many revisions, I guess I need some sleep at =20
5:22 AM.

-- fxn
 
H

Harry Kakueki

I want to parse these names and determine that "item" is the common
prefix.
If I understood you correctly, I think this will do it.

r = ["itemz14","item004","item002","item002b","item 5","itemize this"]

s = []
a ,b = r.max.split(//), r.min.split(//)
a.each_with_index{|x,y|s << y if x != b[y]}
(0...s[0]).each {|x| print a[x]}

Harry
 
T

Todd Burch

Holy Cow! You guys are geniuses! Y'all are using these methods like
I've never seen or even thought of using them before.

Problem is solved - script is working - THANKS!!!

Todd
 
P

Peña, Botp

T24gQmVoYWxmIE9mIFRvZGQgQnVyY2g6DQojIEhvbHkgQ293ISAgIFlvdSBndXlzIGFyZSBnZW5p
dXNlcyEgICBZJ2FsbCBhcmUgdXNpbmcgdGhlc2UgDQojIG1ldGhvZHMgbGlrZS4gSSd2ZSBuZXZl
ciBzZWVuIG9yIGV2ZW4gdGhvdWdodCBvZiB1c2luZyB0aGVtIGJlZm9yZS4NCg0KZG9u4oCZdCBj
b3VudCBtZSBwbHMuIGknbSBub3QgaG9seSBub3IgZ2VuaXVzLCAocG9zc2libHkgYSBjb3cgOikg
YnV0IGp1c3QgYSBudWJ5IGxlYXJuaW5nIHJ1YnkgZnIgdGhlIG1hcnZlbG91cyBydWJ5IGNvbW11
bml0eSAodGhhbmtzIHRvIG1hdHosIG5vYnUsIGd1eWRlY291eCwgbWF1cmljaW8sIGRibGFjaywg
Li4ubWwtbGlzdC5zaXplICkuDQoNCmZmIGlzIGFub3RoZXIgZXhhbXBsZSAob2YgbXkgbnViaW5l
c3MpLiBpdCBzaG93cyBuaWZ0aW5lc3Mgb2YgYmFuZyBtZXRob2RzLCBjaGFpbnMsIGFuZCB0aGUg
dW5kZXJ1dGlsaXplZCBhbGw/LA0KDQpDQzpcZmFtaWx5XHJ1Ynk+Y2F0IHRlc3QucmINCmxpc3Qg
PSAldyhpdGVtb25lIGl0ZW10d28gaXRlbTIyMiBpdGVtMDAzMyBpdGVtMTEgaXRleHRoaXMgaXRl
bXRoaXNhbmR0aGF0IGl0ZW10DQpob3NlKQ0KDQptaW4gPSBsaXN0LnNvcnQhe3xhLGJ8IGEuc2l6
ZSA8PT4gYi5zaXplfS5maXJzdA0KYnJlYWsgaWYgbGlzdC5hbGw/e3xhfCBhID1+IC9eI3ttaW59
L30gd2hpbGUgbWluLmNob3AhDQoNCnB1dHMgbWluDQpDOlxmYW1pbHlccnVieT5ydWJ5IHRlc3Qu
cmINCml0ZQ0KDQpDOlxmYW1pbHlccnVieT4NCg0KaSBsdXYgbXkgZmFtaWx5LiBpIGx1diBydWJ5
IDopDQoNCmtpbmQgcmVnYXJkcyAtYm90cA0KDQo=
 
D

Daniel DeLorme

El Aug 3, 2007, a las 2:31 AM, Todd Burch escribió:
Haha! Regexp magic!

list = <<-EOS
item001
item004
item002
item002b
item002C
item 5
item 10
itemize this
EOS

/\A(.*).*(\n\1.*)*\Z/.match(list)[1]
=> "item"

:-D
Daniel
 
X

Xavier Noria

El Aug 3, 2007, a las 6:23 AM, Harry Kakueki escribi=F3:
If I understood you correctly, I think this will do it.

r =3D ["itemz14","item004","item002","item002b","item 5","itemize = this"]

s =3D []
a ,b =3D r.max.split(//), r.min.split(//)
a.each_with_index{|x,y|s << y if x !=3D b[y]}
(0...s[0]).each {|x| print a[x]}

Good, so lexicographic order indeed helps. I would prefer to order =20
the array only once though, and construct the prefix right ahead as a =20=

string so that the code reads more direct:

prefix =3D ''
min, max =3D items.sort.values_at(0, -1)
min.split(//).each_with_index do |c, i|
break if c !=3D max[i, 1]
prefix << c
end
puts prefix

My approach with splits removed and handling of the corner case where =20=

all items are equal is:

prefix =3D items.first
(0...items.first.length).each do |i|
if items.map {|item| item[i, 1]}.uniq.length !=3D 1
prefix =3D prefix[0, i]
break
end
end
puts prefix

That solution does not order the array, but explicitly loops over all =20=

the columns in the prefix plus one via map, and uniqs them. The =20
approach based on sort seems a bit better to me.

-- fxn
 
R

Robert Klemme

------=_Part_88475_10524956.1186141149128
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

2007/8/3 said:
El Aug 3, 2007, a las 2:31 AM, Todd Burch escribi=F3:

Haha! Regexp magic!

list =3D <<-EOS
item001
item004
item002
item002b
item002C
item 5
item 10
itemize this
EOS

/\A(.*).*(\n\1.*)*\Z/.match(list)[1]
=3D> "item"

Amazing! I wonder how this performs for large input sets (because of
the two ".*" and the "()*"). Could generate a lot of backtracking
(apart from the string length itself).

This is probably better in terms of individual comparisons. The idea
is to reduce the prefix length whenever a mismatch is found
(attached). That solution also avoids sorting and multiple traversal
of the input in general. So it's suited for stream based processing
(i.e. for reading large files in one go).

Kind regards

robert

------=_Part_88475_10524956.1186141149128
Content-Type: application/x-ruby; name=pref.rb
Content-Transfer-Encoding: base64
X-Attachment-Id: f_f4wlpdak
Content-Disposition: attachment; filename="pref.rb"

bGlzdCA9IDw8LUVPUy50b19hCml0ZW0wMDEKaXRlbTAwMQppdGVtMDA0Cml0ZW0wMDIKaXRlbTAw
MmIKaXRlbTAwMkMKaXRlbSA1Cml0ZW0gMTAKaXRlbWl6ZSB0aGlzCkVPUwoKcHJlZiA9IGxpc3Qu
c2hpZnQKbGlzdC5lYWNoIGRvIHxzfAogIGkgPSAwCiAgd2hpbGUgcHJlZltpXSA9PSBzW2ldICYm
IGkgPCBwcmVmLnNpemUKICAgIGkgKz0gMQogIGVuZAogIHByZWYuc2xpY2UhIGkuLi0xCmVuZApw
IHByZWYK
------=_Part_88475_10524956.1186141149128--
 
D

Daniel DeLorme

Robert said:
2007/8/3 said:
/\A(.*).*(\n\1.*)*\Z/.match(list)[1]
=> "item"

Amazing! I wonder how this performs for large input sets (because of
the two ".*" and the "()*"). Could generate a lot of backtracking
(apart from the string length itself).

I think it's ok because the .* pattern is always anchored to line
endings. In this context, .*\n is the equivalent of [^\n]*\n which
leaves no room for backtracking.

Daniel
 
R

Robert Klemme

2007/8/3 said:
Robert said:
2007/8/3 said:
/\A(.*).*(\n\1.*)*\Z/.match(list)[1]
=> "item"

Amazing! I wonder how this performs for large input sets (because of
the two ".*" and the "()*"). Could generate a lot of backtracking
(apart from the string length itself).

I think it's ok because the .* pattern is always anchored to line
endings. In this context, .*\n is the equivalent of [^\n]*\n which
leaves no room for backtracking.

I believe you are wrong. Look at "\A(.*).*" - there are multiple ways
this can match (including matching \n inside any of the .* parts).
Anchoring does not help much here IMHO.

Kind regards

robert
 
D

Daniel DeLorme

Robert said:
I believe you are wrong. Look at "\A(.*).*" - there are multiple ways
this can match (including matching \n inside any of the .* parts).
Anchoring does not help much here IMHO.

Yes, there is backtracking for the first part (there has to be, the
whole point being to find the longest matching string), but \n CANNOT
match inside any of the .* parts unless you specify the m modifier.

Daniel
 
L

Lloyd Linklater

I think that this thread explained to me just why so many people love
Ruby so much. There are wildly different approaches to solving this
that all work. People are very different in how they think and how they
solve problems. Ruby can be seen here as SO flexible that, no matter
how you think things through, Ruby still covers all the bases making it
a comfortable shoe for everyone. This may be a unique example of one
size actually fitting all. That is just amazing and borders on magic.

While I am quite good in a few languages, I cannot wait to get good at
Ruby.
 
X

Xavier Noria

El Aug 3, 2007, a las 1:35 PM, Xavier Noria escribi=F3:
prefix =3D ''
min, max =3D items.sort.values_at(0, -1)
min.split(//).each_with_index do |c, i|
break if c !=3D max[i, 1]
prefix << c
end
puts prefix

Yet another iteration: the prefix can be extracted with a regexp, =20
it's shorter although perhaps more obscure:

min, max =3D items.sort.values_at(0, -1)
puts min+max =3D~ /(.).{#{min.length-1}}(?!\1)/m ? $` : min

I don't like the ternary and the fact that we look for the first =20
mismatch instead of extracting directly the prefix in one shot.

-- fxn
 
R

Robert Klemme

------=_Part_89784_24707089.1186148990033
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

2007/8/3 said:
Yes, there is backtracking for the first part (there has to be, the
whole point being to find the longest matching string), but \n CANNOT
match inside any of the .* parts unless you specify the m modifier.

Correct. I always confuse Ruby's /m with other language's variants.
Thanks for the reeducation!

Still it would be interesting to do benchmarks with larger numbers of
words to see what happens.... I have attached a test. It seems the RX
approach is faster very long.

Kind regards

robert

------=_Part_89784_24707089.1186148990033
Content-Type: application/x-ruby; name="pref-rb.rb"
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename="pref-rb.rb"
X-Attachment-Id: f_f4wqh66b

cmVxdWlyZSAnYmVuY2htYXJrJwoKTElTVCA9IDw8LUVPUy50b19hLmZyZWV6ZQppdGVtMDAxCml0
ZW0wMDEKaXRlbTAwNAppdGVtMDAyCml0ZW0wMDJiCml0ZW0wMDJDCml0ZW0gNQppdGVtIDEwCml0
ZW1pemUgdGhpcwpFT1MKCkJlbmNobWFyay5ibSgxNSkgZG8gfHh8CiAgaXQgPSAyCiAgbG9vcCBk
bwogICAgbCA9IExJU1QuZHVwCiAgICBpdC50aW1lcyB7bC5jb25jYXQgbH0KICAgIGl0ICs9IDEK
CiAgICBsMSA9IGwuam9pbgogICAgbDIgPSBsLmR1cAoKICAgIHgucmVwb3J0KCJyeCAgICU1ZCIg
JSBsLnNpemUpIGRvCiAgICAgIHByZWYgPSAvXEEoLiopLiooXG5cMS4qKSpcWi8ubWF0Y2gobDEp
WzFdCiAgICAgIHJhaXNlICJlcnJvciA8I3twcmVmfT4iIHVubGVzcyAiaXRlbSIgPT0gcHJlZgog
ICAgZW5kCiAgICAKICAgIHgucmVwb3J0KCJpdGVyICU1ZCIgJSBsLnNpemUpIGRvCiAgICAgIHBy
ZWYgPSBsMi5zaGlmdAogICAgICBsMi5lYWNoIGRvIHxzfAogICAgICAgIGkgPSAwCiAgICAgICAg
d2hpbGUgcHJlZltpXSA9PSBzW2ldICYmIGkgPCBwcmVmLnNpemUKICAgICAgICAgIGkgKz0gMQog
ICAgICAgIGVuZAogICAgICAgIHByZWYuc2xpY2UhIGkuLi0xCiAgICAgIGVuZAogICAgICByYWlz
ZSAiZXJyb3IgI3twcmVmfSIgdW5sZXNzICJpdGVtIiA9PSBwcmVmCiAgICBlbmQKICBlbmQKZW5k
Cg==
------=_Part_89784_24707089.1186148990033--
 
X

Xavier Noria

El Aug 3, 2007, a las 3:39 PM, Xavier Noria escribi=F3:
El Aug 3, 2007, a las 1:35 PM, Xavier Noria escribi=F3:
prefix =3D ''
min, max =3D items.sort.values_at(0, -1)
min.split(//).each_with_index do |c, i|
break if c !=3D max[i, 1]
prefix << c
end
puts prefix

Yet another iteration: the prefix can be extracted with a regexp, =20
it's shorter although perhaps more obscure:

min, max =3D items.sort.values_at(0, -1)
puts min+max =3D~ /(.).{#{min.length-1}}(?!\1)/m ? $` : min

I don't like the ternary and the fact that we look for the first =20
mismatch instead of extracting directly the prefix in one shot.

This is a direct way:

min, max =3D items.sort.values_at(0, -1)
puts (min+max).match(/\A(.*).*(?=3D.{#{max.length}}\z)\1/m)[1]

Of course if you can rely on some character that does no belong to =20
the items that's easier, it just takes a simplification of Daniel's =20
in the case of "\n":

min, max =3D items.sort.values_at(0, -1)
puts "#{min}\n#{max}".match(/^(.*).*\n\1/)[1]

Albeit those positive approaches are direct they backtrack quite a =20
lot, perhaps the fixed-length look-ahead with \z in the latter may =20
get some optimization, I don't knw. On the contrary the negative =20
approach, the one that looks for a mismatch, does not backtrack at =20
all, in that sense that's the direct one.

Anyway, I doubt I'd use a regexp approach in production code.

-- fxn
 
S

Simon Kröger

Peña said:
[...]
ff is another example (of my nubiness). it shows niftiness of bang methods, chains, and the underutilized all?,

don't forget the any? :)

puts r.first[0...(0..r.first.size).find{|i| r.any?{|s|r.first[0..i]!=s[0..i]}}]

cheers

Simon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top