aligning a set of word substrings to sentence

Steven Bethard · Dec 1, 2005

I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
.... She's gonna write
.... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Here's my current definition of the offsets function::

py> def offsets(tokens, text):
.... start = 0
.... for token in tokens:
.... while text[start].isspace():
.... start += 1
.... text_token = text[start:start+len(token)]
.... assert text_token == token, (text_token, token)
.... yield start, start + len(token)
.... start += len(token)
....

I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

STeVe

Fredrik Lundh · Dec 1, 2005

Steven said:
I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

using the finditer pattern I just posted in another thread:

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

import re

tokens.sort() # lexical order
tokens.reverse() # look for longest match first
pattern = "|".join(map(re.escape, tokens))
pattern = re.compile(pattern)

I get

print [m.span() for m in pattern.finditer(text)]
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

which seems to match your version pretty well.

hope this helps!

</F>

Steven Bethard · Dec 1, 2005

Fredrik said:
Steven said:

I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

Click to expand...

using the finditer pattern I just posted in another thread:

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

import re

tokens.sort() # lexical order
tokens.reverse() # look for longest match first
pattern = "|".join(map(re.escape, tokens))
pattern = re.compile(pattern)

I get

print [m.span() for m in pattern.finditer(text)]
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

which seems to match your version pretty well.

That's what I was looking for. Thanks!

STeVe

Paul McGuire · Dec 1, 2005

Steven Bethard said:
I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Hey, I get the same answer with this:

===================
from pyparsing import oneOf

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

tokenlist = oneOf( " ".join(tokens) )
offsets = [(start,end) for token,start,end in tokenlist.scanString(text) ]

print offsets
===================
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Of course, pyparsing may be a bit heavyweight to drag into a simple function
like this, and certainly not near as fast as regexp. But it was such a nice
way to show how scanString works.

Pyparsing's "oneOf" helper function takes care of the same longest match
issues that Fredrik Lundh handles using sort, reverse, etc. Just so long as
none of the tokens has an embedded space character.

-- Paul

Steven Bethard · Dec 2, 2005

Paul said:
I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Click to expand...

===================
from pyparsing import oneOf

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

tokenlist = oneOf( " ".join(tokens) )
offsets = [(start,end) for token,start,end in tokenlist.scanString(text) ]

print offsets
===================
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Now that's a pretty solution. Three cheers for pyparsing!

STeVe

Michael Spencer · Dec 2, 2005

Steven said:
I've got a list of word substrings (the "tokens") which I need to align
to a string of text (the "sentence"). The sentence is basically the
concatenation of the token list, with spaces sometimes inserted beetween
tokens. I need to determine the start and end offsets of each token in
the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Here's my current definition of the offsets function::

py> def offsets(tokens, text):
... start = 0
... for token in tokens:
... while text[start].isspace():
... start += 1
... text_token = text[start:start+len(token)]
... assert text_token == token, (text_token, token)
... yield start, start + len(token)
... start += len(token)
...

I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

STeVe

Hi Steve:

Any reason you can't simply use str.find in your offsets function?
... ptr = 0
... for token in tokens:
... fpos = text.find(token, ptr)
... if fpos != -1:
... end = fpos + len(token)
... yield (fpos, end)
... ptr = end
...

>>> list(offsets(tokens, text)) [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>>>

Click to expand...

Click to expand...

and then, for an entry in the wacky category, a difflib solution:
... from difflib import SequenceMatcher
... s = SequenceMatcher(None, text, "\t".join(tokens))
... for start, _, length in s.get_matching_blocks():
... if length:
... yield start, start + length
...

>>> list(offsets(tokens, text)) [(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]
>>>

Click to expand...

Click to expand...

cheers
Michael

Fredrik Lundh · Dec 2, 2005

Steven said:
I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

Click to expand...

using the finditer pattern I just posted in another thread:

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

import re

tokens.sort() # lexical order
tokens.reverse() # look for longest match first
pattern = "|".join(map(re.escape, tokens))
pattern = re.compile(pattern)

I get

print [m.span() for m in pattern.finditer(text)]
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

which seems to match your version pretty well.

Click to expand...

That's what I was looking for. Thanks!

except that I misread your problem statement; the RE solution above allows the
tokens to be specified in arbitrary order. if they've always ordered, you can re-
place the code with something like:

# match tokens plus optional whitespace between each token
pattern = "\s*".join("(" + re.escape(token) + ")" for token in tokens)
m = re.match(pattern, text)
result = (m.span(i+1) for i in range(len(tokens)))

which is 6-7 times faster than the previous solution, on my machine.

</F>

Steven Bethard · Dec 2, 2005

Fredrik said:
Steven Bethard wrote:

I feel like there should be a simpler solution (maybe with the re
module?) but I can't figure one out. Any suggestions?

using the finditer pattern I just posted in another thread:

tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
text = '''\
She's gonna write
a book?'''

import re

tokens.sort() # lexical order
tokens.reverse() # look for longest match first
pattern = "|".join(map(re.escape, tokens))
pattern = re.compile(pattern)

I get

print [m.span() for m in pattern.finditer(text)]
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

which seems to match your version pretty well.

Click to expand...

That's what I was looking for. Thanks!

Click to expand...

except that I misread your problem statement; the RE solution above allows the
tokens to be specified in arbitrary order. if they've always ordered, you can re-
place the code with something like:

# match tokens plus optional whitespace between each token
pattern = "\s*".join("(" + re.escape(token) + ")" for token in tokens)
m = re.match(pattern, text)
result = (m.span(i+1) for i in range(len(tokens)))

which is 6-7 times faster than the previous solution, on my machine.

Ahh yes, that's faster for me too. Thanks again!

STeVe

Steven Bethard · Dec 2, 2005

Michael said:
Steven said:

I've got a list of word substrings (the "tokens") which I need to
align to a string of text (the "sentence"). The sentence is basically
the concatenation of the token list, with spaces sometimes inserted
beetween tokens. I need to determine the start and end offsets of
each token in the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

Click to expand...

[snip]

and then, for an entry in the wacky category, a difflib solution:
... from difflib import SequenceMatcher
... s = SequenceMatcher(None, text, "\t".join(tokens))
... for start, _, length in s.get_matching_blocks():
... if length:
... yield start, start + length
...[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24, 25)]

That's cool, I've never seen that before. If you pass in str.isspace,
you can even drop the "if length:" line::

py> def offsets(tokens, text):
.... s = SequenceMatcher(str.isspace, text, '\t'.join(tokens))
.... for start, _, length in s.get_matching_blocks():
.... yield start, start + length
....
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
25), (25, 25)]

I think I'm going to have to take a closer look at
difflib.SequenceMatcher; I have to do things similar to this pretty often...

STeVe

Steven Bethard · Dec 2, 2005

Steven said:
Michael said:

Steven said:

I've got a list of word substrings (the "tokens") which I need to
align to a string of text (the "sentence"). The sentence is
basically the concatenation of the token list, with spaces sometimes
inserted beetween tokens. I need to determine the start and end
offsets of each token in the sentence. For example::

py> tokens = ['She', "'s", 'gon', 'na', 'write', 'a', 'book', '?']
py> text = '''\
... She's gonna write
... a book?'''
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
25)]

Click to expand...

[snip]

and then, for an entry in the wacky category, a difflib solution:

def offsets(tokens, text):

Click to expand...

... from difflib import SequenceMatcher
... s = SequenceMatcher(None, text, "\t".join(tokens))
... for start, _, length in s.get_matching_blocks():
... if length:
... yield start, start + length
...

list(offsets(tokens, text))

Click to expand...

[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
25)]

Click to expand...

That's cool, I've never seen that before. If you pass in str.isspace,
you can even drop the "if length:" line::

py> def offsets(tokens, text):
... s = SequenceMatcher(str.isspace, text, '\t'.join(tokens))
... for start, _, length in s.get_matching_blocks():
... yield start, start + length
...
py> list(offsets(tokens, text))
[(0, 3), (3, 5), (6, 9), (9, 11), (12, 17), (18, 19), (20, 24), (24,
25), (25, 25)]

Sorry, that should have been::
list(offsets(tokens, text))[:-1]
since the last item is always the zero-length one. Which means you
don't really need str.isspace either.

STeVe

aligning ElementTrees to text	0	Jan 17, 2007
aligning SGML to text	4	Jun 18, 2006
aligning text with space-normalized text	6	Jun 30, 2005
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
all arrangement of a word	12	Jul 7, 2008
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
TPG error when using 't' as the first letter of a token	5	Nov 18, 2004
How to build an index of phrases in a phrase/sentence?	9	May 27, 2005

aligning a set of word substrings to sentence

Steven Bethard

Fredrik Lundh

Steven Bethard

Paul McGuire

Steven Bethard

Michael Spencer

Fredrik Lundh

Steven Bethard

Steven Bethard

Steven Bethard

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads