aligning ElementTrees to text

Discussion in 'Python' started by Steven Bethard, Jan 17, 2007.

  1. I'm trying to align an XML file with the original text file from which
    it was created. Unfortunately, the XML version of the file has added and
    removed some of the whitespace. For example::

    >>> plain_text = '''

    ... Pacific First Financial Corp. said shareholders approved its
    ... acquisition.
    ... '''
    >>> xml_text = ''' <s>Pacific First Financial Corp.

    ... <EVENT eid="e1" class="REPORTING" > said </EVENT> shareholders
    ... <EVENT eid="e2" class="OCCURRENCE" >approved</EVENT> its
    ... <EVENT eid="e8" class="OCCURRENCE" > acquis ition </EVENT>.
    ... </s>
    ... '''

    I want to determine which offsets in the *original* text each element
    from the XML text is supposed to cover. So I want something like::

    >>> xml_tree = etree.fromstring(xml_text)
    >>> align(xml_tree, plain_text)

    [(<Element 'EVENT' at 01411B00>, 31, 35),
    (<Element 'EVENT' at 01411EA8>, 49, 57),
    (<Element 'EVENT' at 01411E18>, 62, 73),
    (<Element 's' at 01411FC8>, 1, 74)]

    where ``align`` has returned a list of all elements in the XML text
    along with their start and end indices in the original text::

    >>> plain_text[31:35]

    >>> plain_text[49:57]

    >>> plain_text[62:73]


    Note that I want to ignore whitespace as much as possible, so the
    elements are aligned only to the non-whitespace text they include.

    Below is my current implementation of the ``align`` function. It seems
    pretty messy to me -- can anyone offer me some advice on how to clean it
    up or write it differently?

    def align(tree, text):

    def align_helper(elem, elem_start):
    # skip whitespace in the text before the element
    while text[elem_start:elem_start + 1].isspace():
    elem_start += 1

    # advance the element end past any element text
    elem_end = elem_start
    if elem.text is not None:
    for char in elem.text:
    if not char.isspace():
    while text[elem_end:elem_end + 1].isspace():
    elem_end += 1
    assert text[elem_end] == char
    elem_end += 1

    # advance the element end past any child elements
    for child_elem in elem:
    elem_end = align_helper(child_elem, elem_end)

    # advance the start for the next element past the tail text
    next_start = elem_end
    if elem.tail is not None:
    for char in elem.tail:
    if not char.isspace():
    while text[next_start:next_start + 1].isspace():
    next_start += 1
    assert text[next_start] == char
    next_start += 1

    # add the element and its start and end to the result list
    result.append((elem, elem_start, elem_end))

    # return the start of the next element
    return next_start

    result = []
    align_helper(tree, 0)
    return result


    Steven Bethard, Jan 17, 2007
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Onur Bozkurt

    Re: aligning text in a asp label

    Onur Bozkurt, Jul 25, 2003, in forum: ASP .Net
    Onur Bozkurt
    Jul 25, 2003
  2. =?Utf-8?B?Sm9l?=

    Aligning text

    =?Utf-8?B?Sm9l?=, Jan 25, 2006, in forum: ASP .Net
    Jan 25, 2006
  3. Phillip
  4. BGW
  5. Steven Bethard

    aligning text with space-normalized text

    Steven Bethard, Jun 30, 2005, in forum: Python
    Steven Bethard
    Jul 1, 2005

Share This Page