fast text processing

A

Alexis Gallagher

(I tried to post this yesterday but I think my ISP ate it. Apologies if
this is a double-post.)

Is it possible to do very fast string processing in python? My
bioinformatics application needs to scan very large ASCII files (80GB+),
compare adjacent lines, and conditionally do some further processing. I
believe the disk i/o is the main bottleneck so for now that's what I'm
optimizing. What I have now is roughly as follows (on python 2.3.5).

filehandle = open("data",'r',buffering=1000)

lastLine = filehandle.readline()

for currentLine in filehandle.readlines():

lastTokens = lastLine.strip().split(delimiter)
currentTokens = currentLine.strip().split(delimiter)

lastGeno = extract(lastTokens[0])
currentGeno = extract(currentTokens[0])

# prepare for next iteration
lastLine = currentLine

if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))

So on every iteration I'm processing mutable strings -- this seems
wrong. What's the best way to speed this up? Can I switch to some fast
byte-oriented immutable string library? Are there optimizing compilers?
Are there better ways to prep the file handle?

Perhaps this is a job for C, but I am of that soft generation which
fears memory management. I'd need to learn how to do buffered reading in
C, how to wrap the C in python, and how to let the C call back into
python to call markEquivalent(). It sounds painful. I _have_ done some
benchmark comparisons of only the underlying line-based file reading
against a Common Lisp version, but I doubt I'm using the optimal
construct in either language so I hesitate to trust my results, and
anyway the interlanguage bridge will be even more obscure in that case.

Much obliged for any help,
Alexis
 
S

Steve Holden

Alexis said:
(I tried to post this yesterday but I think my ISP ate it. Apologies if
this is a double-post.)

Is it possible to do very fast string processing in python? My
bioinformatics application needs to scan very large ASCII files (80GB+),
compare adjacent lines, and conditionally do some further processing. I
believe the disk i/o is the main bottleneck so for now that's what I'm
optimizing. What I have now is roughly as follows (on python 2.3.5).

filehandle = open("data",'r',buffering=1000)

This buffer size seems, shall we say, unadventurous? It's likely to slow
things down considerably, since the filesystem is probably going to
naturally wnt to use a rather larger value. I'd suggest a 64k minumum.
lastLine = filehandle.readline()
I'd suggest

lastTokens = filehandle.readline().strip().split(delimiter)

here. You have no need of the line other than to split it into tokens.
for currentLine in filehandle.readlines():
Note that this is going to read the whole file in to (virtual) memory
before entering the loop. I somehow suspect you'd rather avoid this if
you could. I further suspect your testing has been with smaller files
than 80GB ;-). You might want to consider

for currentLine in filehandle:

as an alternative. This uses the file's generator properties to produce
the next line each time round the loop.
lastTokens = lastLine.strip().split(delimiter)

The line above goes away if you adopt the loop initialization suggestion
above. Otherwise you are repeating the splitting of each line twice,
once as the current line then again as the last line.
currentTokens = currentLine.strip().split(delimiter)

lastGeno = extract(lastTokens[0])
currentGeno = extract(currentTokens[0])
If the extract() operation is stateless (in other words if it always
produces the same output for a given input) then again you are
unecessarily repeating yourself here by running extract() on the same
data as the current first token and the last first token (if you see
what I mean).

I might also observe that you seem to expect only two tokens per line.
If this is invariable the case then you might want to consider writing
an unpacking assignment instead, such as

cToken0, cToken1, newline = currentLine.strip().split(delimiter)

to save the indexing. Not a big deal, thugh, and it *will* break if you
have more than one delimiter in a line, as the interpreter won;t then
know what to do with the third and subsequent elements.
# prepare for next iteration
lastLine = currentLine
Of course now you are going to try and strip the delimiter off the line
and split it again when you loop around again. You should now just be
able to say

lastTokens = currentTokens

instead.
if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))

So on every iteration I'm processing mutable strings -- this seems
wrong. What's the best way to speed this up? Can I switch to some fast
byte-oriented immutable string library? Are there optimizing compilers?
Are there better ways to prep the file handle?
I'm sorry but I am not sure where the mutable strings come in. Python
strings are immutable anyway. Well-known for it. It might be a slight
problem that you are creating a second terminator-less copy of each
line, but that's probably inevitable.

Of course you leave us in the dark about the nature of
table.markEquivalent as well. Depending on the efficiency of the
extract() operation you might want to consider short-circuiting the loop
if the two tokens have already been marked as equivalent. That might be
a big win or not depending on relative efficiency.
Perhaps this is a job for C, but I am of that soft generation which
fears memory management. I'd need to learn how to do buffered reading in
C, how to wrap the C in python, and how to let the C call back into
python to call markEquivalent(). It sounds painful. I _have_ done some
benchmark comparisons of only the underlying line-based file reading
against a Common Lisp version, but I doubt I'm using the optimal
construct in either language so I hesitate to trust my results, and
anyway the interlanguage bridge will be even more obscure in that case.
Probably the biggest gain will be in simply not reading the whole file
into memory by calling its .readlines() method.

Summarising. consider something more like:

filehandle = open("data",'r',buffering=64*1024)
# You could also try just leaving the buffering spec out

lastTokens = filehandle.readline().strip().split(delim)
lastGeno = extract(lastTokens[0])

for currentLine in filehandle:

currentTokens = currentLine.strip().split(delim)
currentGeno = extract(currentTokens[0])
if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]), int(currentTokens[1]))

lastGeno = currentGeno
lastTokens = currentTokens
Much obliged for any help,
Alexis

Hope this is.

one last thought: the bioPython package will potentially save you huge
amounts of time. They guys who developed and maintain it have done a lot
of genome-slinging, and they appear to know what they are doing.

They may have already solved several problems you are struggling with.

regards
Steve
 
B

Ben Sizer

Maybe this code will be faster? (If it even does the same thing:
largely untested)


filehandle = open("data",'r',buffering=1000)
fileIter = iter(filehandle)

lastLine = fileIter.next()
lastTokens = lastLine.strip().split(delimiter)
lastGeno = extract(lastTokens[0])

for currentLine in fileIter:
currentTokens = currentLine.strip().split(delimiter)
currentGeno = extract(currentTokens[0])

if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))

# prepare for next iteration
lastLine = currentLine
lastTokens = currentTokens
lastGeno = currentGeno


I'd be tempted to try a bigger file buffer too, personally.
 
A

Alexis Gallagher

Steve,

First, many thanks!

Steve said:
This buffer size seems, shall we say, unadventurous? It's likely to slow
things down considerably, since the filesystem is probably going to
naturally wnt to use a rather larger value. I'd suggest a 64k minumum.

Good to know. I should have dug into the docs deeper. Somehow I thought
it listed lines not bytes.
Note that this is going to read the whole file in to (virtual) memory
before entering the loop. I somehow suspect you'd rather avoid this if
you could. I further suspect your testing has been with smaller files
than 80GB ;-). You might want to consider

Oops! Thanks again. I thought that readlines() was the generator form,
based on the docstring comments about the deprecation of xreadlines().
I'm sorry but I am not sure where the mutable strings come in. Python
strings are immutable anyway. Well-known for it.

I misspoke. I think was mixing this up with the issue of object-creation
overhead for all of the string handling in general. Is this a bottleneck
to string processing in python, or is this a hangover from my Java days?
I would have thought that dumping the standard string processing
libraries in favor of byte manipulation would have been one of the
biggest wins.
Of course you leave us in the dark about the nature of
table.markEquivalent as well.

markEquivalent() implements union-join (aka, uptrees) to generate
equivalence classes. Optimising that was going to be my next task

I feel a bit silly for missing the double-processing of everything.
Thanks for pointing that out. And I will check out the biopython package.

I'm still curious if optimizing compilers are worth examining. For
instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm
guessing that both this tokenizing and the uptree implementations sound
like good candidates for one of those tools, once I shake out these
algorithmic problems.


alexis
 
L

Larry Bates

Alexis said:
Steve,

First, many thanks!



Good to know. I should have dug into the docs deeper. Somehow I thought
it listed lines not bytes.


Oops! Thanks again. I thought that readlines() was the generator form,
based on the docstring comments about the deprecation of xreadlines().


I misspoke. I think was mixing this up with the issue of object-creation
overhead for all of the string handling in general. Is this a bottleneck
to string processing in python, or is this a hangover from my Java days?
I would have thought that dumping the standard string processing
libraries in favor of byte manipulation would have been one of the
biggest wins.


markEquivalent() implements union-join (aka, uptrees) to generate
equivalence classes. Optimising that was going to be my next task

I feel a bit silly for missing the double-processing of everything.
Thanks for pointing that out. And I will check out the biopython package.

I'm still curious if optimizing compilers are worth examining. For
instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm
guessing that both this tokenizing and the uptree implementations sound
like good candidates for one of those tools, once I shake out these
algorithmic problems.


alexis

When your problem is I/O bound there is almost nothing that can be
done to speed it up without some sort of refactoring of the input
data itself. Python reads bytes off a hard drive just as fast as
any compiled language. A good test is to copy the file and measure
the time. You can't make your program run any faster than a copy
of the file itself without making hardware changes (e.g. RAID
arrays, etc.).

You might also want to take a look at csv module. Reading lines
and splitting on delimeters is almost always handled well by csv.

-Larry Bates
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top