aligning ElementTrees to text

S

Steven Bethard

I'm trying to align an XML file with the original text file from which
it was created. Unfortunately, the XML version of the file has added and
removed some of the whitespace. For example::
... Pacific First Financial Corp. said shareholders approved its
... acquisition.
... ''' ... <EVENT eid="e1" class="REPORTING" > said </EVENT> shareholders
... <EVENT eid="e2" class="OCCURRENCE" >approved</EVENT> its
... <EVENT eid="e8" class="OCCURRENCE" > acquis ition </EVENT>.
... </s>
... '''

I want to determine which offsets in the *original* text each element
from the XML text is supposed to cover. So I want something like::
[(<Element 'EVENT' at 01411B00>, 31, 35),
(<Element 'EVENT' at 01411EA8>, 49, 57),
(<Element 'EVENT' at 01411E18>, 62, 73),
(<Element 's' at 01411FC8>, 1, 74)]

where ``align`` has returned a list of all elements in the XML text
along with their start and end indices in the original text::
>>> plain_text[31:35] 'said'
>>> plain_text[49:57] 'approved'
>>> plain_text[62:73]
'acquisition'

Note that I want to ignore whitespace as much as possible, so the
elements are aligned only to the non-whitespace text they include.


Below is my current implementation of the ``align`` function. It seems
pretty messy to me -- can anyone offer me some advice on how to clean it
up or write it differently?

def align(tree, text):

def align_helper(elem, elem_start):
# skip whitespace in the text before the element
while text[elem_start:elem_start + 1].isspace():
elem_start += 1

# advance the element end past any element text
elem_end = elem_start
if elem.text is not None:
for char in elem.text:
if not char.isspace():
while text[elem_end:elem_end + 1].isspace():
elem_end += 1
assert text[elem_end] == char
elem_end += 1

# advance the element end past any child elements
for child_elem in elem:
elem_end = align_helper(child_elem, elem_end)

# advance the start for the next element past the tail text
next_start = elem_end
if elem.tail is not None:
for char in elem.tail:
if not char.isspace():
while text[next_start:next_start + 1].isspace():
next_start += 1
assert text[next_start] == char
next_start += 1

# add the element and its start and end to the result list
result.append((elem, elem_start, elem_end))

# return the start of the next element
return next_start

result = []
align_helper(tree, 0)
return result


Thanks,

STeVe
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,902
Latest member
Elena68X5

Latest Threads

Top