Help: Creating condensed expressions

D

David Hirschfield

Here's the problem: Given a list of item names like:

apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM

which is a subset of a much larger list of items,
is there an efficient algorithm to create condensed forms that match
those items, and only those items? Such as:

apple[12]
apple3_SD
form[ABC]
kla_M[MB]
kca_MM

The condensed expression syntax only has [...] and * as operators. [...]
matches a set of individual characters, * matches any string.
I'd be satisfied with a solution that only uses the [...] syntax, since
I don't think it's possible to use * without potentially matching items
not explicitly in the given list.

I'm not sure what this condensed expression syntax is called (looks a
bit like shell name expansion syntax), and I'm not even sure there is an
efficient way to do what I'm asking. Any ideas would be appreciated.

Thanks,
-David
 
B

bearophileHUGS

This is a first try, is something like this enough for you?


data = """apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM""".split()

headtails = {}
for word in data:
head = word[:-1]
if head in headtails:
headtails[head].append(word[-1])
else:
headtails[head] = [word[-1]]
for head, tails in sorted(headtails.iteritems()):
if len(tails) == 1:
print head + tails[0]
else:
print head + "[%s]" % "".join(tails)


Output:
apple[12]
apple3_SD
form[ABC]
kca_MM
kla_M[MB]

It looks only the last letters. It modifies the original order (it
sorts the sequence on the root-word).
If you don't need sorted results you can remove the sorted().

Bye,
bearophile
 
B

bearophileHUGS

Nevermind, I didn't understand the problem/question... Sorry.

Bye,
bearophile
 
E

Eddie Corns

David Hirschfield said:
Here's the problem: Given a list of item names like:
apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM

which is a subset of a much larger list of items,
is there an efficient algorithm to create condensed forms that match
those items, and only those items? Such as:
apple[12]
apple3_SD
form[ABC]
kla_M[MB]
kca_MM

The condensed expression syntax only has [...] and * as operators. [...]
matches a set of individual characters, * matches any string.
I'd be satisfied with a solution that only uses the [...] syntax, since
I don't think it's possible to use * without potentially matching items
not explicitly in the given list.
I'm not sure what this condensed expression syntax is called (looks a
bit like shell name expansion syntax), and I'm not even sure there is an
efficient way to do what I'm asking. Any ideas would be appreciated.

If you used regular expressions instead of globs then I think what you want is
to optimize the regular expression for:

'apple1|apple2|apple3_SD|formA|formB|formC|kla_MM|kla_MB|kca_MM'

then spit the optimised finite automaton back out. Of course if you're just
searching Python might do a decent job of optimising it anyway.

Looking at:

http://en.wikipedia.org/wiki/Finite_state_machine#Optimization

they make it sound so easy!. There's also a whole bunch of tools on that
page. Maybe there's an optimiser you can use in one of them.

Failing that I guess you build a tree and try to merge nodes where they fork.
I suspect an optimal solution would be hard but if you knew your input did
have lots of redundancy a simple algorithm might work.

Or I could be talking crap as usual.

Eddie
 
M

Michael Spencer

Nevermind, I didn't understand the problem/question... Sorry.

Bye,
bearophile
Really? Your solution looks fine to me.

In any case, here's an alternative approach to the (based on the same
understanding of the problem as bearophile's, but with the additional
requirement that the input terms be sorted)
>>> from itertools import groupby
>>> from operator import itemgetter
>>> src_iter = ((i[:-1],i[-1]) for i in src.splitlines()) #src must be sorted
>>> grouped = groupby(src_iter, itemgetter(0))
>>> stemmed = ((stem, "".join(i[1] for i in values)) for stem, values in grouped)
>>> [(len(s[1])>1 and "%s[%s]" or "%s%s") % s for s in stemmed] ['apple[12]', 'apple3_SD', 'form[ABC]', 'kla_M[MB]', 'kca_MM']
>>>


Michael
 
B

Bruno Desthuilliers

David Hirschfield a écrit :
Here's the problem: Given a list of item names like:

apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM

which is a subset of a much larger list of items,
is there an efficient algorithm to create condensed forms that match
those items, and only those items? Such as:

apple[12]
apple3_SD
form[ABC]
kla_M[MB]
kca_MM


The condensed expression syntax only has [...] and * as operators. [...]
matches a set of individual characters, * matches any string.
I'd be satisfied with a solution that only uses the [...] syntax, since
I don't think it's possible to use * without potentially matching items
not explicitly in the given list.

I'm not sure what this condensed expression syntax is called (looks a
bit like shell name expansion syntax),

Looks like a very restricted subset of regular expressions.
and I'm not even sure there is an
efficient way to do what I'm asking. Any ideas would be appreciated.


import re

lines = """
apple1
apple2
apple3_SD
formA
formB
formC
kla_MM
kla_MB
kca_MM
""".strip().split()

patterns = [
r'apple[12]',
r'apple3_SD',
r'form[ABC]',
r'kla_M[MB]',
r'kca_MM',
]

for pat in patterns:
for line in lines:
m = re.match(pat, line)
print "%s match %s : %s" % (pat, line, m and "Yes" or 'No')


HTH
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top