aligning SGML to text

Steven Bethard · Jun 18, 2006

I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:

TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
cytoplasmic translocation and concomitant formation of an intracellular
signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.

And the corresponding SGML looks like:

<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
</PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
</PROTEIN> , resulting in cytoplasmic translocation and concomitant
formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
<PROTEIN> TRAF2 </PROTEIN> , and AIPl .

Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation.

I need to determine the indices in the original text that each SGML
element corresponds to. Here's some working code to do this, based on a
suggestion for a related problem by Fredrik Lundh[1]::

def align(text, sgml):
sgml = sgml.replace('&', '&')
tree = etree.fromstring('<xml>%s</xml>' % sgml)
words = []
if tree.text is not None:
words.extend(tree.text.split())
word_indices = []
for elem in tree:
elem_words = elem.text.split()
start = len(words)
end = start + len(elem_words)
word_indices.append((start, end, elem.tag))
words.extend(elem_words)
if elem.tail is not None:
words.extend(elem.tail.split())
expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
match = re.match(expr, text)
assert match is not None
for word_start, word_end, label in word_indices:
start = match.start(word_start + 1)
end = match.end(word_end)
yield label, start, end

TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.''' <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
<PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
and concomitant formation of an <PROTEIN> intracellular signaling
complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

The problem is, this doesn't work when my text is long (which it is)
because regular expressions are limited to 100 groups. I get an error
like::

Traceback (most recent call last):
...
AssertionError: sorry, but this version only supports 100 named
groups

I also played around with difflib.SequenceMatcher for a while, but
couldn't get a solution based on that working. Any suggestions?

[1]http://mail.python.org/pipermail/python-list/2005-December/313388.html

Thanks,

STeVe

Gerard Flanagan · Jun 18, 2006

Steven said:
I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:

TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
cytoplasmic translocation and concomitant formation of an intracellular
signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.

And the corresponding SGML looks like:

<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
</PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
</PROTEIN> , resulting in cytoplasmic translocation and concomitant
formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
<PROTEIN> TRAF2 </PROTEIN> , and AIPl .

Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation.

I need to determine the indices in the original text that each SGML
element corresponds to. Here's some working code to do this, based on a
suggestion for a related problem by Fredrik Lundh[1]::

def align(text, sgml):
sgml = sgml.replace('&', '&')
tree = etree.fromstring('<xml>%s</xml>' % sgml)
words = []
if tree.text is not None:
words.extend(tree.text.split())
word_indices = []
for elem in tree:
elem_words = elem.text.split()
start = len(words)
end = start + len(elem_words)
word_indices.append((start, end, elem.tag))
words.extend(elem_words)
if elem.tail is not None:
words.extend(elem.tail.split())
expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
match = re.match(expr, text)
assert match is not None
for word_start, word_end, label in word_indices:
start = match.start(word_start + 1)
end = match.end(word_end)
yield label, start, end
[...][('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

The problem is, this doesn't work when my text is long (which it is)
because regular expressions are limited to 100 groups. I get an error
like::

[...]

Steve

This is probably an abuse of itertools...

---8<---
text = '''TNF binding induces release of AIP1 (DAB2IP) from
TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.'''

sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of
<PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
<PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
and concomitant formation of an <PROTEIN> intracellular signaling
complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
<PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
'''

import itertools as it
import string

def scan(line):
if not line: return
line = line.strip()
parts = string.split(line, '>', maxsplit=1)
return parts[0]

def align(txt,sml):
i = 0
for k,g in it.groupby(sml.split('<'),scan):
g = list(g)
if not g[0]: continue
text = g[0].split('>')[1]#.replace('\n','')
if k.startswith('/'):
i += len(text)
else:
offset = len(text.strip())
yield k, i, i+offset
i += offset

print list(align(text,sgml))

------------

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]

It's off because of the punctuation possibly, can't figure it out.
maybe you can tweak it?

hth

Gerard

Steven Bethard · Jun 19, 2006

Gerard said:
Steven said:

I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:

TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
cytoplasmic translocation and concomitant formation of an intracellular
signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.

And the corresponding SGML looks like:

<PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
</PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
</PROTEIN> , resulting in cytoplasmic translocation and concomitant
formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
<PROTEIN> TRAF2 </PROTEIN> , and AIPl .

Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation.

I need to determine the indices in the original text that each SGML
element corresponds to. Here's some working code to do this, based on a
suggestion for a related problem by Fredrik Lundh[1]::

def align(text, sgml):
sgml = sgml.replace('&', '&')
tree = etree.fromstring('<xml>%s</xml>' % sgml)
words = []
if tree.text is not None:
words.extend(tree.text.split())
word_indices = []
for elem in tree:
elem_words = elem.text.split()
start = len(words)
end = start + len(elem_words)
word_indices.append((start, end, elem.tag))
words.extend(elem_words)
if elem.tail is not None:
words.extend(elem.tail.split())
expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
match = re.match(expr, text)
assert match is not None
for word_start, word_end, label in word_indices:
start = match.start(word_start + 1)
end = match.end(word_end)
yield label, start, end
[...]

list(align(text, sgml))

Click to expand...

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

The problem is, this doesn't work when my text is long (which it is)
because regular expressions are limited to 100 groups. I get an error
like::

Click to expand...

[...]

Steve

This is probably an abuse of itertools...

---8<---
text = '''TNF binding induces release of AIP1 (DAB2IP) from
TNFR1, resulting in cytoplasmic translocation and concomitant
formation of an intracellular signaling complex comprised of TRADD,
RIP1, TRAF2, and AIPl.'''

sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of
<PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
<PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
and concomitant formation of an <PROTEIN> intracellular signaling
complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
<PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
'''

import itertools as it
import string

def scan(line):
if not line: return
line = line.strip()
parts = string.split(line, '>', maxsplit=1)
return parts[0]

def align(txt,sml):
i = 0
for k,g in it.groupby(sml.split('<'),scan):
g = list(g)
if not g[0]: continue
text = g[0].split('>')[1]#.replace('\n','')
if k.startswith('/'):
i += len(text)
else:
offset = len(text.strip())
yield k, i, i+offset
i += offset

print list(align(text,sgml))

------------

[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]

It's off because of the punctuation possibly, can't figure it out.

Thanks for taking a look. Yeah, the alignment's a big part of the
problem. It'd be really nice if the thing that gives me SGML didn't add
whitespace haphazardly. ;-)

STeVe

Steven Bethard · Jun 19, 2006

Steven said:
I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.) [snip]
Note that the SGML inserts spaces not only within the SGML elements, but
also around punctuation. [snip]
I need to determine the indices in the original text that each SGML
element corresponds to.

Ok, below is a working version that doesn't use regular expressions.
It's far from concise, but at least it doesn't fail like re does when I
have more than 100 words. =)
.... # convert SGML tree to words, and assemble a list of the
.... # start word index and end word index for each SGML element
.... sgml = sgml.replace('&', '&')
.... tree = etree.fromstring('<xml>%s</xml>' % sgml)
.... words = []
.... if tree.text is not None:
.... words.extend(tree.text.split())
.... word_spans = []
.... for elem in tree:
.... elem_words = elem.text.split()
.... start = len(words)
.... end = start + len(elem_words)
.... word_spans.append((start, end, elem.tag))
.... words.extend(elem_words)
.... if elem.tail is not None:
.... words.extend(elem.tail.split())
.... # determine the start character index and end character index
.... # for each word from the SGML
.... char_spans = []
.... start = 0
.... for word in words:
.... while text[start:start + 1].isspace():
.... start += 1
.... end = start + len(word)
.... assert text[start:end] == word, (text[start:end], word)
.... char_spans.append((start, end))
.... start = end
.... # convert the word indices for each SGML element to
.... # character indices
.... for word_start, word_end, label in word_spans:
.... start, _ = char_spans[word_start]
.... _, end = char_spans[word_end - 1]
.... yield label, start, end
....resulting in cytoplasmic translocation and concomitant formation of an
intracellular signaling complex comprised of TRADD, RIP1, TRAF2, and
AIPl.'''<PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN>
TNFR1 </PROTEIN> , resulting in cytoplasmic translocation and
concomitant formation of an <PROTEIN> intracellular signaling complex
</PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1
[('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

STeVe

Gerard Flanagan · Jun 19, 2006

Steven said:
Gerard said:

Steven said:

I have some plain text data and some SGML markup for that text that I
need to align. (The SGML doesn't maintain the original whitespace, so I
have to do some alignment; I can't just calculate the indices directly.)
For example, some of my text looks like:

Click to expand...

[...]

Steve

This is probably an abuse of itertools...

Click to expand...

[snip hammering]

Thanks for taking a look. Yeah, the alignment's a big part of the
problem. It'd be really nice if the thing that gives me SGML didn't add
whitespace haphazardly. ;-)

STeVe

I see, the problem was different than I thought. When all you have is a
hammer...

Gerard

aligning ElementTrees to text	0	Jan 17, 2007
aligning text with space-normalized text	6	Jun 30, 2005
Script to fetch Wikipedia text	4	Oct 11, 2006
Replies to Seebach - attempting to post to clc moderated	5	Sep 11, 2009
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
Request for Feedback; a module making it easier to use regular expressions.	1	Jan 31, 2005
Question on the design of my search, and binding a subordinate DataGrid/Gridview	0	Feb 2, 2007
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

aligning SGML to text

Steven Bethard

Gerard Flanagan

Steven Bethard

Steven Bethard

Gerard Flanagan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads