aligning text with space-normalized text

Discussion in 'Python' started by Steven Bethard, Jun 30, 2005.

  1. I have a string with a bunch of whitespace in it, and a series of chunks
    of that string whose indices I need to find. However, the chunks have
    been whitespace-normalized, so that multiple spaces and newlines have
    been converted to single spaces as if by ' '.join(chunk.split()). Some
    example data to clarify my problem:

    py> text = """\
    .... aaa bb ccc
    .... dd eee. fff gggg
    .... hh i.
    .... jjj kk.
    .... """
    py> chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']

    Note that the original "text" has a variety of whitespace between words,
    but the corresponding "chunks" have only single space characters between
    "words". I'm looking for the indices of each chunk, so for this
    example, I'd like:

    py> result = [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]

    Note that the indices correspond to the *original* text so that the
    substrings in the given spans include the irregular whitespace:

    py> for s, e in result:
    .... print repr(text[s:e])
    ....
    'aaa bb'
    'ccc\ndd eee.'
    'fff gggg\nhh i.'
    'jjj'
    'kk.'

    I'm trying to write code to produce the indices. Here's what I have:

    py> def get_indices(text, chunks):
    .... chunks = iter(chunks)
    .... chunk = None
    .... for text_index, c in enumerate(text):
    .... if c.isspace():
    .... continue
    .... if chunk is None:
    .... chunk = chunks.next().replace(' ', '')
    .... chunk_start = text_index
    .... chunk_index = 0
    .... if c != chunk[chunk_index]:
    .... raise Exception('unmatched: %r %r' %
    .... (c, chunk[chunk_index]))
    .... else:
    .... chunk_index += 1
    .... if chunk_index == len(chunk):
    .... yield chunk_start, text_index + 1
    .... chunk = None
    ....

    And it appears to work:

    py> list(get_indices(text, chunks))
    [(3, 10), (11, 22), (24, 40), (44, 47), (48, 51)]
    py> list(get_indices(text, chunks)) == result
    True

    But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
    Pythonic way[1] of writing this code?

    Thanks in advance,

    STeVe

    [1] Yes, I'm aware that these are subjective terms. I'm looking for
    subjectively "better" solutions. ;)
    Steven Bethard, Jun 30, 2005
    #1
    1. Advertising

  2. Steven Bethard

    John Machin Guest

    Steven Bethard wrote:
    [snip]
    > And it appears to work:

    [snip]
    > But it seems somewhat inelegant. Can anyone see an easier/cleaner/more
    > Pythonic way[1] of writing this code?
    >
    > Thanks in advance,
    >
    > STeVe
    >
    > [1] Yes, I'm aware that these are subjective terms. I'm looking for
    > subjectively "better" solutions. ;)


    Perhaps you should define "work" before you worry about """subjectively
    "better" solutions""".

    If "work" is meant to detect *all* possibilities of 'chunks' not having
    been derived from 'text' in the described manner, then it doesn't work
    -- all information about the positions of the whitespace is thrown away
    by your code.

    For example, text = 'foo bar', chunks = ['foobar']
    John Machin, Jun 30, 2005
    #2
    1. Advertising

  3. John Machin wrote:
    > If "work" is meant to detect *all* possibilities of 'chunks' not having
    > been derived from 'text' in the described manner, then it doesn't work
    > -- all information about the positions of the whitespace is thrown away
    > by your code.
    >
    > For example, text = 'foo bar', chunks = ['foobar']


    This doesn't match the (admittedly vague) spec which said that chunks
    are created "as if by ' '.join(chunk.split())". For the text:
    'foo bar'
    the possible chunk lists should be something like:
    ['foo bar']
    ['foo', 'bar']
    If it helps, you can think of chunks as lists of words, where the words
    have been ' '.join()ed.

    STeVe
    Steven Bethard, Jun 30, 2005
    #3
  4. Steven Bethard

    Peter Otten Guest

    Steven Bethard wrote:

    > I have a string with a bunch of whitespace in it, and a series of chunks
    > of that string whose indices I need to find.  However, the chunks have
    > been whitespace-normalized, so that multiple spaces and newlines have
    > been converted to single spaces as if by ' '.join(chunk.split()).  Some


    If you are willing to get your hands dirty with regexps:

    import re
    _reLump = re.compile(r"\S+")

    def indices(text, chunks):
    lumps = _reLump.finditer(text)
    for chunk in chunks:
    lump = [lumps.next() for _ in chunk.split()]
    yield lump[0].start(), lump[-1].end()


    def main():
    text = """\
    aaa bb ccc
    dd eee. fff gggg
    hh i.
    jjj kk.
    """
    chunks = ['aaa bb', 'ccc dd eee.', 'fff gggg hh i.', 'jjj', 'kk.']
    assert list(indices(text, chunks)) == [(3, 10), (11, 22), (24, 40), (44,
    47), (48, 51)]

    if __name__ == "__main__":
    main()

    Not tested beyond what you see.

    Peter
    Peter Otten, Jun 30, 2005
    #4
  5. Steven Bethard

    John Machin Guest

    Steven Bethard wrote:
    > John Machin wrote:
    >
    >> If "work" is meant to detect *all* possibilities of 'chunks' not
    >> having been derived from 'text' in the described manner, then it
    >> doesn't work -- all information about the positions of the whitespace
    >> is thrown away by your code.
    >>
    >> For example, text = 'foo bar', chunks = ['foobar']

    >
    >
    > This doesn't match the (admittedly vague) spec


    That is *exactly* my point -- it is not valid input, and you are not
    reporting all cases of invalid input; you have an exception where the
    non-spaces are impossible, but no exception where whitespaces are
    impossible.


    which said that chunks
    > are created "as if by ' '.join(chunk.split())". For the text:
    > 'foo bar'
    > the possible chunk lists should be something like:
    > ['foo bar']
    > ['foo', 'bar']
    > If it helps, you can think of chunks as lists of words, where the words
    > have been ' '.join()ed.


    If it helps, you can re-read my message.

    >
    > STeVe
    John Machin, Jun 30, 2005
    #5
  6. John Machin wrote:
    > Steven Bethard wrote:
    >
    >> John Machin wrote:
    >>
    >>> For example, text = 'foo bar', chunks = ['foobar']

    >>
    >> This doesn't match the (admittedly vague) spec

    >
    > That is *exactly* my point -- it is not valid input, and you are not
    > reporting all cases of invalid input; you have an exception where the
    > non-spaces are impossible, but no exception where whitespaces are
    > impossible.


    Well, the input should never look like the above. But if for some
    reason it did, I wouldn't want the error; I'd want the indices. So:
    text = 'foo bar'
    chunks = ['foobar']
    should produce:
    [(0, 7)]
    not an exception.

    STeVe
    Steven Bethard, Jul 1, 2005
    #6
  7. Peter Otten wrote:
    > import re
    > _reLump = re.compile(r"\S+")
    >
    > def indices(text, chunks):
    > lumps = _reLump.finditer(text)
    > for chunk in chunks:
    > lump = [lumps.next() for _ in chunk.split()]
    > yield lump[0].start(), lump[-1].end()


    Thanks, that's a really nice, clean solution!

    STeVe
    Steven Bethard, Jul 1, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Shuo Xiang

    Stack space, global space, heap space

    Shuo Xiang, Jul 9, 2003, in forum: C Programming
    Replies:
    10
    Views:
    2,864
    Bryan Bullard
    Jul 11, 2003
  2. Christian Seberino
    Replies:
    21
    Views:
    1,622
    Stephen Horne
    Oct 27, 2003
  3. Ian Bicking
    Replies:
    2
    Views:
    979
    Steve Lamb
    Oct 23, 2003
  4. Ian Bicking
    Replies:
    2
    Views:
    706
    Michael Hudson
    Oct 24, 2003
  5. mike
    Replies:
    1
    Views:
    291
Loading...

Share This Page