Best practice for operations on streams of text

J

James

Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.

The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.

I originally had code like this:
for t0 in token_stream:
for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
for t3 in remove_dupes(t2):
yield t3

Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.

Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?

Thanks,
James
 
J

J Kenneth King

James said:
Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.

The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.

I originally had code like this:
for t0 in token_stream:
for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
for t3 in remove_dupes(t2):
yield t3

Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.

Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?

Thanks,
James

Co-routines my friends. Google will help you greatly in discovering
this processing wonder.
 
G

Gary Herron

James said:
Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.

The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.

I originally had code like this:
for t0 in token_stream:
for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
for t3 in remove_dupes(t2):
yield t3

Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.

Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?

Thanks,
James

David Beazly has a very interesting talk on using generators for
building and linking together individual stream filters. Its very cool
and surprisingly eye-opening.

See "Generator Tricks for Systems Programmers" at
http://www.dabeaz.com/generators/

Gary Herron
 
M

MRAB

James said:
Hello all,
I'm working on some NLP code - what I'm doing is passing a large
number of tokens through a number of filtering / processing steps.

The filters take a token as input, and may or may not yield a token as
a result. For example, I might have filters which lowercases the
input, filter out boring words and filter out duplicates chained
together.

I originally had code like this:
for t0 in token_stream:
for t1 in lowercase_token(t0):
for t2 in remove_boring(t1):
for t3 in remove_dupes(t2):
yield t3

Apart from being ugly as sin, I only get one token out as
StopIteration is raised before the whole token stream is consumed.

Any suggestions on an elegant way to chain together a bunch of
generators, with processing steps in between?
What you should be doing is letting the filters accept an iterator and
yield values on demand:

def lowercase_token(stream):
for t in stream:
yield t.lower()

def remove_boring(stream):
for t in stream:
if t not in boring:
yield t

def remove_dupes(stream):
seen = set()
for t in stream:
if t not in seen:
yield t
seen.add(t)

def compound_filter(token_stream):
stream = lowercase_token(token_stream)
stream = remove_boring(stream)
stream = remove_dupes(stream)
for t in stream(t):
yield t
 
T

Terry Reedy

For that to work at all, the three functions would have to turn each
token into an iterable of 0 or 1 tokens. Hence the inner 'loops' would
execute 0 or 1 times. Better to return a token or None, and replace the
three inner 'loops' with three conditional statements (ugly too) or less
efficiently (due to lack of short circuiting),

t = remove_dupes(remove_boring(lowercase_token(t0)))
if t is not None: yield t

That puzzles me. Your actual code must be slightly different from the
above and what I imagine the functions to be. But nevermind, because

MRAB's suggestion is the way to go. Your automatically get
short-circuiting because each generator only gets what is passed on.
And resuming a generator is much faster that re-calling a function.
What you should be doing is letting the filters accept an iterator and
yield values on demand:

def lowercase_token(stream):
for t in stream:
yield t.lower()

def remove_boring(stream):
for t in stream:
if t not in boring:
yield t

def remove_dupes(stream):
seen = set()
for t in stream:
if t not in seen:
yield t
seen.add(t)

def compound_filter(token_stream):
stream = lowercase_token(token_stream)
stream = remove_boring(stream)
stream = remove_dupes(stream)
for t in stream(t):
yield t

I also recommend the Beazly reference Herron gave.

tjr
 
B

Beni Cherniavsky

def compound_filter(token_stream):
     stream = lowercase_token(token_stream)
     stream = remove_boring(stream)
     stream = remove_dupes(stream)
     for t in stream(t):
         yield t

The last loop is superfluous. You can just do::

def compound_filter(token_stream):
stream = lowercase_token(token_stream)
stream = remove_boring(stream)
stream = remove_dupes(stream)
return stream

which is simpler and slightly more efficient. This works because from
the caller's perspective, a generator is just a function that returns
an iterator. It doesn't matter whether it implements the iterator
itself by containing ``yield`` statements, or shamelessly passes on an
iterator implemented elsewhere.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top