Best practice for operations on streams of text

Discussion in 'Python' started by James, May 7, 2009.

  1. James

    James Guest

    Hello all,
    I'm working on some NLP code - what I'm doing is passing a large
    number of tokens through a number of filtering / processing steps.

    The filters take a token as input, and may or may not yield a token as
    a result. For example, I might have filters which lowercases the
    input, filter out boring words and filter out duplicates chained
    together.

    I originally had code like this:
    for t0 in token_stream:
    for t1 in lowercase_token(t0):
    for t2 in remove_boring(t1):
    for t3 in remove_dupes(t2):
    yield t3

    Apart from being ugly as sin, I only get one token out as
    StopIteration is raised before the whole token stream is consumed.

    Any suggestions on an elegant way to chain together a bunch of
    generators, with processing steps in between?

    Thanks,
    James
    James, May 7, 2009
    #1
    1. Advertising

  2. James <> writes:

    > Hello all,
    > I'm working on some NLP code - what I'm doing is passing a large
    > number of tokens through a number of filtering / processing steps.
    >
    > The filters take a token as input, and may or may not yield a token as
    > a result. For example, I might have filters which lowercases the
    > input, filter out boring words and filter out duplicates chained
    > together.
    >
    > I originally had code like this:
    > for t0 in token_stream:
    > for t1 in lowercase_token(t0):
    > for t2 in remove_boring(t1):
    > for t3 in remove_dupes(t2):
    > yield t3
    >
    > Apart from being ugly as sin, I only get one token out as
    > StopIteration is raised before the whole token stream is consumed.
    >
    > Any suggestions on an elegant way to chain together a bunch of
    > generators, with processing steps in between?
    >
    > Thanks,
    > James


    Co-routines my friends. Google will help you greatly in discovering
    this processing wonder.
    J Kenneth King, May 7, 2009
    #2
    1. Advertising

  3. James

    Gary Herron Guest

    James wrote:
    > Hello all,
    > I'm working on some NLP code - what I'm doing is passing a large
    > number of tokens through a number of filtering / processing steps.
    >
    > The filters take a token as input, and may or may not yield a token as
    > a result. For example, I might have filters which lowercases the
    > input, filter out boring words and filter out duplicates chained
    > together.
    >
    > I originally had code like this:
    > for t0 in token_stream:
    > for t1 in lowercase_token(t0):
    > for t2 in remove_boring(t1):
    > for t3 in remove_dupes(t2):
    > yield t3
    >
    > Apart from being ugly as sin, I only get one token out as
    > StopIteration is raised before the whole token stream is consumed.
    >
    > Any suggestions on an elegant way to chain together a bunch of
    > generators, with processing steps in between?
    >
    > Thanks,
    > James
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >


    David Beazly has a very interesting talk on using generators for
    building and linking together individual stream filters. Its very cool
    and surprisingly eye-opening.

    See "Generator Tricks for Systems Programmers" at
    http://www.dabeaz.com/generators/

    Gary Herron
    Gary Herron, May 7, 2009
    #3
  4. James

    MRAB Guest

    James wrote:
    > Hello all,
    > I'm working on some NLP code - what I'm doing is passing a large
    > number of tokens through a number of filtering / processing steps.
    >
    > The filters take a token as input, and may or may not yield a token as
    > a result. For example, I might have filters which lowercases the
    > input, filter out boring words and filter out duplicates chained
    > together.
    >
    > I originally had code like this:
    > for t0 in token_stream:
    > for t1 in lowercase_token(t0):
    > for t2 in remove_boring(t1):
    > for t3 in remove_dupes(t2):
    > yield t3
    >
    > Apart from being ugly as sin, I only get one token out as
    > StopIteration is raised before the whole token stream is consumed.
    >
    > Any suggestions on an elegant way to chain together a bunch of
    > generators, with processing steps in between?
    >

    What you should be doing is letting the filters accept an iterator and
    yield values on demand:

    def lowercase_token(stream):
    for t in stream:
    yield t.lower()

    def remove_boring(stream):
    for t in stream:
    if t not in boring:
    yield t

    def remove_dupes(stream):
    seen = set()
    for t in stream:
    if t not in seen:
    yield t
    seen.add(t)

    def compound_filter(token_stream):
    stream = lowercase_token(token_stream)
    stream = remove_boring(stream)
    stream = remove_dupes(stream)
    for t in stream(t):
    yield t
    MRAB, May 7, 2009
    #4
  5. James

    Terry Reedy Guest

    MRAB wrote:
    > James wrote:
    >> Hello all,
    >> I'm working on some NLP code - what I'm doing is passing a large
    >> number of tokens through a number of filtering / processing steps.
    >>
    >> The filters take a token as input, and may or may not yield a token as
    >> a result. For example, I might have filters which lowercases the
    >> input, filter out boring words and filter out duplicates chained
    >> together.
    >>
    >> I originally had code like this:
    >> for t0 in token_stream:
    >> for t1 in lowercase_token(t0):
    >> for t2 in remove_boring(t1):
    >> for t3 in remove_dupes(t2):
    >> yield t3


    For that to work at all, the three functions would have to turn each
    token into an iterable of 0 or 1 tokens. Hence the inner 'loops' would
    execute 0 or 1 times. Better to return a token or None, and replace the
    three inner 'loops' with three conditional statements (ugly too) or less
    efficiently (due to lack of short circuiting),

    t = remove_dupes(remove_boring(lowercase_token(t0)))
    if t is not None: yield t

    >> Apart from being ugly as sin, I only get one token out as
    >> StopIteration is raised before the whole token stream is consumed.


    That puzzles me. Your actual code must be slightly different from the
    above and what I imagine the functions to be. But nevermind, because

    >> Any suggestions on an elegant way to chain together a bunch of
    >> generators, with processing steps in between?


    MRAB's suggestion is the way to go. Your automatically get
    short-circuiting because each generator only gets what is passed on.
    And resuming a generator is much faster that re-calling a function.

    > What you should be doing is letting the filters accept an iterator and
    > yield values on demand:
    >
    > def lowercase_token(stream):
    > for t in stream:
    > yield t.lower()
    >
    > def remove_boring(stream):
    > for t in stream:
    > if t not in boring:
    > yield t
    >
    > def remove_dupes(stream):
    > seen = set()
    > for t in stream:
    > if t not in seen:
    > yield t
    > seen.add(t)
    >
    > def compound_filter(token_stream):
    > stream = lowercase_token(token_stream)
    > stream = remove_boring(stream)
    > stream = remove_dupes(stream)
    > for t in stream(t):
    > yield t


    I also recommend the Beazly reference Herron gave.

    tjr
    Terry Reedy, May 7, 2009
    #5
  6. On May 8, 12:07 am, MRAB <> wrote:
    > def compound_filter(token_stream):
    >      stream = lowercase_token(token_stream)
    >      stream = remove_boring(stream)
    >      stream = remove_dupes(stream)
    >      for t in stream(t):
    >          yield t


    The last loop is superfluous. You can just do::

    def compound_filter(token_stream):
    stream = lowercase_token(token_stream)
    stream = remove_boring(stream)
    stream = remove_dupes(stream)
    return stream

    which is simpler and slightly more efficient. This works because from
    the caller's perspective, a generator is just a function that returns
    an iterator. It doesn't matter whether it implements the iterator
    itself by containing ``yield`` statements, or shamelessly passes on an
    iterator implemented elsewhere.
    Beni Cherniavsky, May 17, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jesus M. Salvo Jr.
    Replies:
    2
    Views:
    4,160
    robert
    Feb 11, 2006
  2. Scott Brady Drummonds
    Replies:
    3
    Views:
    904
    Dietmar Kuehl
    Jan 18, 2005
  3. Replies:
    2
    Views:
    319
    Johnny Google
    Nov 2, 2004
  4. Josh Mcfarlane
    Replies:
    4
    Views:
    301
    Axter
    Dec 7, 2005
  5. oldyork90
    Replies:
    1
    Views:
    156
    Jeremy J Starcher
    Sep 10, 2008
Loading...

Share This Page