best way to align words?

Robert R. · Nov 30, 2006

Hello,

i would like to write a piece of code to help me to align some sequence
of words and suggest me the ordered common subwords of them

s0 = "this is an example of a thing i would like to have".split()
s1 = "another example of something else i would like to have".split()
s2 = 'and this is another " example " but of something ; now i would
still like to have'.split()
....
alist = (s0, s1, s2)

result should be : ('example', 'of', 'i', 'would', 'like', 'to', 'have'

but i do not know how should i start, may be have you a helpful
suggestion?
a trouble i have if when having many different strings my results tend
to be nothing while i still would like to have one of the, or maybe,
all the best matches.

best.

Thomas Ploch · Nov 30, 2006

Robert said:
Hello,

i would like to write a piece of code to help me to align some sequence
of words and suggest me the ordered common subwords of them

s0 = "this is an example of a thing i would like to have".split()
s1 = "another example of something else i would like to have".split()
s2 = 'and this is another " example " but of something ; now i would
still like to have'.split()
...
alist = (s0, s1, s2)

result should be : ('example', 'of', 'i', 'would', 'like', 'to', 'have'

but i do not know how should i start, may be have you a helpful
suggestion?
a trouble i have if when having many different strings my results tend
to be nothing while i still would like to have one of the, or maybe,
all the best matches.

best.

As far as I can see, you want to have the words, that all three lists
have in common, right?

s0 = "this is an example of a thing i would like to have".split()
s1 = "another example of something else i would like to have".split()
s2 = 'and this is another " example " but of something ; now i would
still like to have'.split()

def findCommons(s0, s1, s2):
res = []
for word in s0:
if word in s1 and word in s2:
res.append(word)
return res
['example', 'of', 'i', 'would', 'like', 'to', 'have']

Mitja Trampus · Nov 30, 2006

Robert said:
Hello,

i would like to write a piece of code to help me to align some sequence
of words and suggest me the ordered common subwords of them

a trouble i have if when having many different strings my results tend
to be nothing while i still would like to have one of the, or maybe,
all the best matches.

"align"?
Anyway, for finding the commonest words, you'll be best off
counting how many times each word appears:

lst = ["foo bar baz", "qux foo foo kaka", "one foo and kaka
times qux"]

for line in lst:
for word in line.split():
count[word] = count.get(word,0) + 1

Now you go for the ones with the highest count:

for (word, n) in sorted(d.items(), key = lambda x: x[1],
reverse = True):
print word, 'appears', n, 'times'

Untested. If you want to count the number of lines a word
appears in (as opposed to the number of times it appears at
all), add an extra condition before count[word] = ...

Oleg Batrashev · Nov 30, 2006

i would like to write a piece of code to help me to align some sequence
of words and suggest me the ordered common subwords of them

Im not sure what you want, but in case you are guy who knows how
quicksort and Djikstra algorithms work

and wants to find out more.

There are many algorithms out there, discovered on "Text algorithms"
univesity course. The first one does not directly solve your problem -
"edit distance" (Levenshtein distance)
http://en.wikipedia.org/wiki/Levenshtein_distance
I mention it here only because it is simple and shows basic idea of
Dynamic Programming
http://en.wikipedia.org/wiki/Dynamic_programming

If you scroll down you'll see "Longest common subsequence problem" with
implementation in Python for 2 sequences. If you dont understand how it
works just look into "edit distance" idea and see it is exactly the
same algorithm with changed rules.

Oleg

bearophileHUGS · Nov 30, 2006

Robert R.:

i would like to write a piece of code to help me to align some sequence
of words and suggest me the ordered common subwords of them [...]
a trouble i have if when having many different strings my results tend
to be nothing while i still would like to have one of the, or maybe,
all the best matches.

This is my first solution try, surely there are faster, shorter, better
solutions...

from collections import defaultdict
from itertools import chain
from graph import Graph
# http://sourceforge.net/projects/pynetwork/

def commonOrdered(*strings):
lists = [[w for w in string.lower().split() if w.isalpha()] for
string in strings]

freqs = defaultdict(int)
for w in chain(*lists):
freqs[w] += 1

g = Graph()
for words in lists:
g.addPath(words)

len_strings = len(strings)
return [w for w in g.toposort() if freqs[w]==len_strings]

s0 = "this is an example of a thing i would like to have"
s1 = "another example of something else i would like to have"
s2 = 'and this is another " example " but of something ; now i would
still like to have'

print commonOrdered(s0, s1, s2)

It creates a graph with the paths of words, then sorts the graph
topologically, then takes only the words of the sorting that are
present in all the original strings.
With a bit of work the code can be used if it contains words like
"example" instead of " example ".
An xtoposort method too can be added to the Graph class...

Bye,
bearophile

bearophileHUGS · Nov 30, 2006

This is my first solution try, surely there are faster, shorter, better

solutions...
It creates a graph with the paths of words, then sorts the graph
topologically,

Beside possible inefficiencies, this "solution" breaks if words aren't
in the correct order, the topological sort can't work...
I'll have to think about better solutions, if possible.

Sorry,
bye,
bearophile

Noah Rawlins · Nov 30, 2006

Robert said:
Hello,

i would like to write a piece of code to help me to align some sequence
of words and suggest me the ordered common subwords of them

s0 = "this is an example of a thing i would like to have".split()
s1 = "another example of something else i would like to have".split()
s2 = 'and this is another " example " but of something ; now i would
still like to have'.split()
...
alist = (s0, s1, s2)

result should be : ('example', 'of', 'i', 'would', 'like', 'to', 'have'

but i do not know how should i start, may be have you a helpful
suggestion?
a trouble i have if when having many different strings my results tend
to be nothing while i still would like to have one of the, or maybe,
all the best matches.

best.

Your requirements are a little vague... how are these three strings handled?

s1 = "hello there dudes"
s2 = "dudes hello there"
s3 = "there dudes hello"

they all share the 3 words, but what order do you want them back?

here is a simplistic approach using sets that results in a list of words
that are in all strings ordered arbitrarily by their order in the first
string ( it also doesn't worry about matches (or lack of) due to
punctuation and case and crap like that)

>>> strList = []
>>> strList.append('this is an example of a thing i would like to have')
>>> strList.append('another example of something else i would like to have')
>>> strList.append('and this is another " example " but of something ; now i would still like to have')
>>> [word for word in strList[0].split() if word in reduce(lambda x, y:

Click to expand...

Click to expand...

x.intersection(y), [set(str.split()) for str in strList])]
['example', 'of', 'i', 'would', 'like', 'to', 'have']

but you still have issues with mutiple matches and how they are handled
etc...

noah

Noah Rawlins · Nov 30, 2006

Noah said:
strList = []
strList.append('this is an example of a thing i would like to have')
strList.append('another example of something else i would like to have')
strList.append('and this is another " example " but of something ; now i would still like to have')
[word for word in strList[0].split() if word in reduce(lambda x, y:

Click to expand...

Click to expand...

x.intersection(y), [set(str.split()) for str in strList])]
['example', 'of', 'i', 'would', 'like', 'to', 'have']

I think that ends up doing the set reduction over and over for every
word in the first string, so you probably want to move that outside the
list comprehension

noah

Robert R. · Dec 2, 2006

Hello,

thanks for all your replies, i'm now looking to dynamic programming...

sorry for forgetting to say that i wanted the words to be ordered, thus
:

s1 = "hello there dudes"
s2 = "dudes hello there"
s3 = "there dudes hello"

will not return anything while sharing all three words.

Bearophile your solution with graph looks interesting although i still
do not understand how it works, but yes there is definitively something
with drawing path around words.

i have tried SequenceMatcher from difflib after using combinations of
all sentences as i need to process much more than the 3 of my first
example.

best.

Oleg Batrashev · Dec 2, 2006

thanks for all your replies, i'm now looking to dynamic programming...
Id better warn you before you go further.
"Notice that LCS is often defined to be finding all common
subsequences of a maximum length. This problem inherently has higher
complexity, as the number of such subsequences is exponential in the
worst case"

This means that if you have 10 sentences with 5 words in each there is
5^10 space and time complexity. Definitelly, there are better
algorithms from dynamic programming, but you should review your needs:
how many sentences, words you have.

There can be easier way than dynamic programming.

Oleg

Robert R. · Dec 3, 2006

Oleg Batrashev a écrit :

This means that if you have 10 sentences with 5 words in each there is
5^10 space and time complexity. Definitelly, there are better
algorithms from dynamic programming, but you should review your needs:
how many sentences, words you have.

it can be few to many, actually it depends of the words i'm looking for.

What is the best way of going about recreating the setTimeout() function?	0	Sep 2, 2022
replace only full words	8	Sep 28, 2013
Optimal way to make a table for large lists	2	Jul 7, 2022
C++ Programming Searching for Good Way to Learn	2	Mar 2, 2023
Best way to do this? List loop (matrix?) iteration	5	Jan 8, 2013
Best way to disconnect from ldap?	12	Mar 21, 2012
Best way to find starting directory	0	Mar 19, 2013
What's the best way to extract 2 values from a CSV file from each row systematically?	6	Sep 23, 2013

best way to align words?

Robert R.

Thomas Ploch

Mitja Trampus

Oleg Batrashev

bearophileHUGS

bearophileHUGS

Noah Rawlins

Noah Rawlins

Robert R.

Oleg Batrashev

Robert R.

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads