aligning SGML to text

Discussion in 'Python' started by Steven Bethard, Jun 18, 2006.

  1. I have some plain text data and some SGML markup for that text that I
    need to align. (The SGML doesn't maintain the original whitespace, so I
    have to do some alignment; I can't just calculate the indices directly.)
    For example, some of my text looks like:

    TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
    cytoplasmic translocation and concomitant formation of an intracellular
    signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.

    And the corresponding SGML looks like:

    <PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
    </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
    </PROTEIN> , resulting in cytoplasmic translocation and concomitant
    formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
    comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
    <PROTEIN> TRAF2 </PROTEIN> , and AIPl .

    Note that the SGML inserts spaces not only within the SGML elements, but
    also around punctuation.


    I need to determine the indices in the original text that each SGML
    element corresponds to. Here's some working code to do this, based on a
    suggestion for a related problem by Fredrik Lundh[1]::

    def align(text, sgml):
    sgml = sgml.replace('&', '&amp;')
    tree = etree.fromstring('<xml>%s</xml>' % sgml)
    words = []
    if tree.text is not None:
    words.extend(tree.text.split())
    word_indices = []
    for elem in tree:
    elem_words = elem.text.split()
    start = len(words)
    end = start + len(elem_words)
    word_indices.append((start, end, elem.tag))
    words.extend(elem_words)
    if elem.tail is not None:
    words.extend(elem.tail.split())
    expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
    match = re.match(expr, text)
    assert match is not None
    for word_start, word_end, label in word_indices:
    start = match.start(word_start + 1)
    end = match.end(word_end)
    yield label, start, end


    >>> text = '''TNF binding induces release of AIP1 (DAB2IP) from

    TNFR1, resulting in cytoplasmic translocation and concomitant
    formation of an intracellular signaling complex comprised of TRADD,
    RIP1, TRAF2, and AIPl.'''
    >>> sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of

    <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
    <PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
    and concomitant formation of an <PROTEIN> intracellular signaling
    complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
    <PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
    '''
    >>> list(align(text, sgml))

    [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
    ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
    ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

    The problem is, this doesn't work when my text is long (which it is)
    because regular expressions are limited to 100 groups. I get an error
    like::

    Traceback (most recent call last):
    ...
    AssertionError: sorry, but this version only supports 100 named
    groups

    I also played around with difflib.SequenceMatcher for a while, but
    couldn't get a solution based on that working. Any suggestions?


    [1]http://mail.python.org/pipermail/python-list/2005-December/313388.html

    Thanks,

    STeVe
    Steven Bethard, Jun 18, 2006
    #1
    1. Advertising

  2. Steven Bethard wrote:
    > I have some plain text data and some SGML markup for that text that I
    > need to align. (The SGML doesn't maintain the original whitespace, so I
    > have to do some alignment; I can't just calculate the indices directly.)
    > For example, some of my text looks like:
    >
    > TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
    > cytoplasmic translocation and concomitant formation of an intracellular
    > signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.
    >
    > And the corresponding SGML looks like:
    >
    > <PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
    > </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
    > </PROTEIN> , resulting in cytoplasmic translocation and concomitant
    > formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
    > comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
    > <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
    >
    > Note that the SGML inserts spaces not only within the SGML elements, but
    > also around punctuation.
    >
    >
    > I need to determine the indices in the original text that each SGML
    > element corresponds to. Here's some working code to do this, based on a
    > suggestion for a related problem by Fredrik Lundh[1]::
    >
    > def align(text, sgml):
    > sgml = sgml.replace('&', '&amp;')
    > tree = etree.fromstring('<xml>%s</xml>' % sgml)
    > words = []
    > if tree.text is not None:
    > words.extend(tree.text.split())
    > word_indices = []
    > for elem in tree:
    > elem_words = elem.text.split()
    > start = len(words)
    > end = start + len(elem_words)
    > word_indices.append((start, end, elem.tag))
    > words.extend(elem_words)
    > if elem.tail is not None:
    > words.extend(elem.tail.split())
    > expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
    > match = re.match(expr, text)
    > assert match is not None
    > for word_start, word_end, label in word_indices:
    > start = match.start(word_start + 1)
    > end = match.end(word_end)
    > yield label, start, end
    >

    [...]
    > >>> list(align(text, sgml))

    > [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
    > ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
    > ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]
    >
    > The problem is, this doesn't work when my text is long (which it is)
    > because regular expressions are limited to 100 groups. I get an error
    > like::

    [...]

    Steve

    This is probably an abuse of itertools...

    ---8<---
    text = '''TNF binding induces release of AIP1 (DAB2IP) from
    TNFR1, resulting in cytoplasmic translocation and concomitant
    formation of an intracellular signaling complex comprised of TRADD,
    RIP1, TRAF2, and AIPl.'''

    sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of
    <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
    <PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
    and concomitant formation of an <PROTEIN> intracellular signaling
    complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
    <PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
    '''

    import itertools as it
    import string

    def scan(line):
    if not line: return
    line = line.strip()
    parts = string.split(line, '>', maxsplit=1)
    return parts[0]

    def align(txt,sml):
    i = 0
    for k,g in it.groupby(sml.split('<'),scan):
    g = list(g)
    if not g[0]: continue
    text = g[0].split('>')[1]#.replace('\n','')
    if k.startswith('/'):
    i += len(text)
    else:
    offset = len(text.strip())
    yield k, i, i+offset
    i += offset

    print list(align(text,sgml))

    ------------

    [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
    ('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
    ('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]

    It's off because of the punctuation possibly, can't figure it out.
    maybe you can tweak it?

    hth

    Gerard
    Gerard Flanagan, Jun 18, 2006
    #2
    1. Advertising

  3. Gerard Flanagan wrote:
    > Steven Bethard wrote:
    >> I have some plain text data and some SGML markup for that text that I
    >> need to align. (The SGML doesn't maintain the original whitespace, so I
    >> have to do some alignment; I can't just calculate the indices directly.)
    >> For example, some of my text looks like:
    >>
    >> TNF binding induces release of AIP1 (DAB2IP) from TNFR1, resulting in
    >> cytoplasmic translocation and concomitant formation of an intracellular
    >> signaling complex comprised of TRADD, RIP1, TRAF2, and AIPl.
    >>
    >> And the corresponding SGML looks like:
    >>
    >> <PROTEIN> TNF </PROTEIN> binding induces release of <PROTEIN> AIP1
    >> </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN> TNFR1
    >> </PROTEIN> , resulting in cytoplasmic translocation and concomitant
    >> formation of an <PROTEIN> intracellular signaling complex </PROTEIN>
    >> comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1 </PROTEIN> ,
    >> <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
    >>
    >> Note that the SGML inserts spaces not only within the SGML elements, but
    >> also around punctuation.
    >>
    >>
    >> I need to determine the indices in the original text that each SGML
    >> element corresponds to. Here's some working code to do this, based on a
    >> suggestion for a related problem by Fredrik Lundh[1]::
    >>
    >> def align(text, sgml):
    >> sgml = sgml.replace('&', '&amp;')
    >> tree = etree.fromstring('<xml>%s</xml>' % sgml)
    >> words = []
    >> if tree.text is not None:
    >> words.extend(tree.text.split())
    >> word_indices = []
    >> for elem in tree:
    >> elem_words = elem.text.split()
    >> start = len(words)
    >> end = start + len(elem_words)
    >> word_indices.append((start, end, elem.tag))
    >> words.extend(elem_words)
    >> if elem.tail is not None:
    >> words.extend(elem.tail.split())
    >> expr = '\s*'.join('(%s)' % re.escape(word) for word in words)
    >> match = re.match(expr, text)
    >> assert match is not None
    >> for word_start, word_end, label in word_indices:
    >> start = match.start(word_start + 1)
    >> end = match.end(word_end)
    >> yield label, start, end
    >>

    > [...]
    >> >>> list(align(text, sgml))

    >> [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
    >> ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
    >> ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]
    >>
    >> The problem is, this doesn't work when my text is long (which it is)
    >> because regular expressions are limited to 100 groups. I get an error
    >> like::

    > [...]
    >
    > Steve
    >
    > This is probably an abuse of itertools...
    >
    > ---8<---
    > text = '''TNF binding induces release of AIP1 (DAB2IP) from
    > TNFR1, resulting in cytoplasmic translocation and concomitant
    > formation of an intracellular signaling complex comprised of TRADD,
    > RIP1, TRAF2, and AIPl.'''
    >
    > sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of
    > <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from
    > <PROTEIN> TNFR1 </PROTEIN> , resulting in cytoplasmic translocation
    > and concomitant formation of an <PROTEIN> intracellular signaling
    > complex </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> ,
    > <PROTEIN> RIP1 </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
    > '''
    >
    > import itertools as it
    > import string
    >
    > def scan(line):
    > if not line: return
    > line = line.strip()
    > parts = string.split(line, '>', maxsplit=1)
    > return parts[0]
    >
    > def align(txt,sml):
    > i = 0
    > for k,g in it.groupby(sml.split('<'),scan):
    > g = list(g)
    > if not g[0]: continue
    > text = g[0].split('>')[1]#.replace('\n','')
    > if k.startswith('/'):
    > i += len(text)
    > else:
    > offset = len(text.strip())
    > yield k, i, i+offset
    > i += offset
    >
    > print list(align(text,sgml))
    >
    > ------------
    >
    > [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 38, 44),
    > ('PROTEIN', 52, 57), ('PROTEIN', 131, 162), ('PROTEIN', 176, 181),
    > ('PROTEIN', 184, 188), ('PROTEIN', 191, 196)]
    >
    > It's off because of the punctuation possibly, can't figure it out.


    Thanks for taking a look. Yeah, the alignment's a big part of the
    problem. It'd be really nice if the thing that gives me SGML didn't add
    whitespace haphazardly. ;-)

    STeVe
    Steven Bethard, Jun 19, 2006
    #3
  4. Steven Bethard wrote:
    > I have some plain text data and some SGML markup for that text that I
    > need to align. (The SGML doesn't maintain the original whitespace, so I
    > have to do some alignment; I can't just calculate the indices directly.)

    [snip]
    > Note that the SGML inserts spaces not only within the SGML elements, but
    > also around punctuation.

    [snip]
    > I need to determine the indices in the original text that each SGML
    > element corresponds to.


    Ok, below is a working version that doesn't use regular expressions.
    It's far from concise, but at least it doesn't fail like re does when I
    have more than 100 words. =)

    >>> import elementtree.ElementTree as etree
    >>> def align(text, sgml):

    .... # convert SGML tree to words, and assemble a list of the
    .... # start word index and end word index for each SGML element
    .... sgml = sgml.replace('&', '&amp;')
    .... tree = etree.fromstring('<xml>%s</xml>' % sgml)
    .... words = []
    .... if tree.text is not None:
    .... words.extend(tree.text.split())
    .... word_spans = []
    .... for elem in tree:
    .... elem_words = elem.text.split()
    .... start = len(words)
    .... end = start + len(elem_words)
    .... word_spans.append((start, end, elem.tag))
    .... words.extend(elem_words)
    .... if elem.tail is not None:
    .... words.extend(elem.tail.split())
    .... # determine the start character index and end character index
    .... # for each word from the SGML
    .... char_spans = []
    .... start = 0
    .... for word in words:
    .... while text[start:start + 1].isspace():
    .... start += 1
    .... end = start + len(word)
    .... assert text[start:end] == word, (text[start:end], word)
    .... char_spans.append((start, end))
    .... start = end
    .... # convert the word indices for each SGML element to
    .... # character indices
    .... for word_start, word_end, label in word_spans:
    .... start, _ = char_spans[word_start]
    .... _, end = char_spans[word_end - 1]
    .... yield label, start, end
    ....
    >>> text = '''TNF binding induces release of AIP1 (DAB2IP) from TNFR1,

    resulting in cytoplasmic translocation and concomitant formation of an
    intracellular signaling complex comprised of TRADD, RIP1, TRAF2, and
    AIPl.'''
    >>> sgml = '''<PROTEIN> TNF </PROTEIN> binding induces release of

    <PROTEIN> AIP1 </PROTEIN> ( <PROTEIN> DAB2IP </PROTEIN> ) from <PROTEIN>
    TNFR1 </PROTEIN> , resulting in cytoplasmic translocation and
    concomitant formation of an <PROTEIN> intracellular signaling complex
    </PROTEIN> comprised of <PROTEIN> TRADD </PROTEIN> , <PROTEIN> RIP1
    </PROTEIN> , <PROTEIN> TRAF2 </PROTEIN> , and AIPl .
    .... '''
    >>> list(align(text, sgml))

    [('PROTEIN', 0, 3), ('PROTEIN', 31, 35), ('PROTEIN', 37, 43),
    ('PROTEIN', 50, 55), ('PROTEIN', 128, 159), ('PROTEIN', 173, 178),
    ('PROTEIN', 180, 184), ('PROTEIN', 186, 191)]

    STeVe
    Steven Bethard, Jun 19, 2006
    #4
  5. Steven Bethard wrote:
    > Gerard Flanagan wrote:
    > > Steven Bethard wrote:
    > >> I have some plain text data and some SGML markup for that text that I
    > >> need to align. (The SGML doesn't maintain the original whitespace, so I
    > >> have to do some alignment; I can't just calculate the indices directly.)
    > >> For example, some of my text looks like:

    [...]
    > >
    > > Steve
    > >
    > > This is probably an abuse of itertools...
    > >

    [snip hammering]
    >
    > Thanks for taking a look. Yeah, the alignment's a big part of the
    > problem. It'd be really nice if the thing that gives me SGML didn't add
    > whitespace haphazardly. ;-)
    >
    > STeVe


    I see, the problem was different than I thought. When all you have is a
    hammer... :)

    Gerard
    Gerard Flanagan, Jun 19, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. rblah
    Replies:
    3
    Views:
    447
    Peter Flynn
    Jan 18, 2004
  2. (Pete Cresswell)

    SGML Parser doesn't like <script> contents?

    (Pete Cresswell), Dec 24, 2004, in forum: HTML
    Replies:
    30
    Views:
    1,378
    dszady
    Dec 27, 2004
  3. Jarek
    Replies:
    2
    Views:
    1,002
    Jarek
    Oct 14, 2005
  4. Clifford W. Racz
    Replies:
    4
    Views:
    2,008
    Clifford W. Racz
    Feb 13, 2004
  5. Steven Bethard

    aligning text with space-normalized text

    Steven Bethard, Jun 30, 2005, in forum: Python
    Replies:
    6
    Views:
    371
    Steven Bethard
    Jul 1, 2005
Loading...

Share This Page