fast text processing

Alexis Gallagher · Feb 21, 2006

(I tried to post this yesterday but I think my ISP ate it. Apologies if
this is a double-post.)

Is it possible to do very fast string processing in python? My
bioinformatics application needs to scan very large ASCII files (80GB+),
compare adjacent lines, and conditionally do some further processing. I
believe the disk i/o is the main bottleneck so for now that's what I'm
optimizing. What I have now is roughly as follows (on python 2.3.5).

filehandle = open("data",'r',buffering=1000)

lastLine = filehandle.readline()

for currentLine in filehandle.readlines():

lastTokens = lastLine.strip().split(delimiter)
currentTokens = currentLine.strip().split(delimiter)

lastGeno = extract(lastTokens[0])
currentGeno = extract(currentTokens[0])

# prepare for next iteration
lastLine = currentLine

if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))

So on every iteration I'm processing mutable strings -- this seems
wrong. What's the best way to speed this up? Can I switch to some fast
byte-oriented immutable string library? Are there optimizing compilers?
Are there better ways to prep the file handle?

Perhaps this is a job for C, but I am of that soft generation which
fears memory management. I'd need to learn how to do buffered reading in
C, how to wrap the C in python, and how to let the C call back into
python to call markEquivalent(). It sounds painful. I _have_ done some
benchmark comparisons of only the underlying line-based file reading
against a Common Lisp version, but I doubt I'm using the optimal
construct in either language so I hesitate to trust my results, and
anyway the interlanguage bridge will be even more obscure in that case.

Much obliged for any help,
Alexis

Steve Holden · Feb 21, 2006

Alexis said:
(I tried to post this yesterday but I think my ISP ate it. Apologies if
this is a double-post.)

Is it possible to do very fast string processing in python? My
bioinformatics application needs to scan very large ASCII files (80GB+),
compare adjacent lines, and conditionally do some further processing. I
believe the disk i/o is the main bottleneck so for now that's what I'm
optimizing. What I have now is roughly as follows (on python 2.3.5).

filehandle = open("data",'r',buffering=1000)

This buffer size seems, shall we say, unadventurous? It's likely to slow
things down considerably, since the filesystem is probably going to
naturally wnt to use a rather larger value. I'd suggest a 64k minumum.

lastLine = filehandle.readline()

I'd suggest

lastTokens = filehandle.readline().strip().split(delimiter)

here. You have no need of the line other than to split it into tokens.

for currentLine in filehandle.readlines():

Note that this is going to read the whole file in to (virtual) memory
before entering the loop. I somehow suspect you'd rather avoid this if
you could. I further suspect your testing has been with smaller files
than 80GB ;-). You might want to consider

for currentLine in filehandle:

as an alternative. This uses the file's generator properties to produce
the next line each time round the loop.

lastTokens = lastLine.strip().split(delimiter)

The line above goes away if you adopt the loop initialization suggestion
above. Otherwise you are repeating the splitting of each line twice,
once as the current line then again as the last line.

currentTokens = currentLine.strip().split(delimiter)

lastGeno = extract(lastTokens[0])
currentGeno = extract(currentTokens[0])

If the extract() operation is stateless (in other words if it always
produces the same output for a given input) then again you are
unecessarily repeating yourself here by running extract() on the same
data as the current first token and the last first token (if you see
what I mean).

I might also observe that you seem to expect only two tokens per line.
If this is invariable the case then you might want to consider writing
an unpacking assignment instead, such as

cToken0, cToken1, newline = currentLine.strip().split(delimiter)

to save the indexing. Not a big deal, thugh, and it *will* break if you
have more than one delimiter in a line, as the interpreter won;t then
know what to do with the third and subsequent elements.

# prepare for next iteration
lastLine = currentLine

Of course now you are going to try and strip the delimiter off the line
and split it again when you loop around again. You should now just be
able to say

lastTokens = currentTokens

instead.

if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))

So on every iteration I'm processing mutable strings -- this seems
wrong. What's the best way to speed this up? Can I switch to some fast
byte-oriented immutable string library? Are there optimizing compilers?
Are there better ways to prep the file handle?

I'm sorry but I am not sure where the mutable strings come in. Python
strings are immutable anyway. Well-known for it. It might be a slight
problem that you are creating a second terminator-less copy of each
line, but that's probably inevitable.

Of course you leave us in the dark about the nature of
table.markEquivalent as well. Depending on the efficiency of the
extract() operation you might want to consider short-circuiting the loop
if the two tokens have already been marked as equivalent. That might be
a big win or not depending on relative efficiency.

Perhaps this is a job for C, but I am of that soft generation which
fears memory management. I'd need to learn how to do buffered reading in
C, how to wrap the C in python, and how to let the C call back into
python to call markEquivalent(). It sounds painful. I _have_ done some
benchmark comparisons of only the underlying line-based file reading
against a Common Lisp version, but I doubt I'm using the optimal
construct in either language so I hesitate to trust my results, and
anyway the interlanguage bridge will be even more obscure in that case.

Probably the biggest gain will be in simply not reading the whole file
into memory by calling its .readlines() method.

Summarising. consider something more like:

filehandle = open("data",'r',buffering=64*1024)
# You could also try just leaving the buffering spec out

lastTokens = filehandle.readline().strip().split(delim)
lastGeno = extract(lastTokens[0])

for currentLine in filehandle:

currentTokens = currentLine.strip().split(delim)
currentGeno = extract(currentTokens[0])
if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]), int(currentTokens[1]))

lastGeno = currentGeno
lastTokens = currentTokens

Much obliged for any help,
Alexis

Hope this is.

one last thought: the bioPython package will potentially save you huge
amounts of time. They guys who developed and maintain it have done a lot
of genome-slinging, and they appear to know what they are doing.

They may have already solved several problems you are struggling with.

regards
Steve

Ben Sizer · Feb 21, 2006

Maybe this code will be faster? (If it even does the same thing:
largely untested)

filehandle = open("data",'r',buffering=1000)
fileIter = iter(filehandle)

lastLine = fileIter.next()
lastTokens = lastLine.strip().split(delimiter)
lastGeno = extract(lastTokens[0])

for currentLine in fileIter:
currentTokens = currentLine.strip().split(delimiter)
currentGeno = extract(currentTokens[0])

if lastGeno == currentGeno:
table.markEquivalent(int(lastTokens[1]),int(currentTokens[1]))

# prepare for next iteration
lastLine = currentLine
lastTokens = currentTokens
lastGeno = currentGeno

I'd be tempted to try a bigger file buffer too, personally.

Alexis Gallagher · Feb 21, 2006

Steve,

First, many thanks!

Steve said:
This buffer size seems, shall we say, unadventurous? It's likely to slow
things down considerably, since the filesystem is probably going to
naturally wnt to use a rather larger value. I'd suggest a 64k minumum.

Good to know. I should have dug into the docs deeper. Somehow I thought
it listed lines not bytes.

Note that this is going to read the whole file in to (virtual) memory
before entering the loop. I somehow suspect you'd rather avoid this if
you could. I further suspect your testing has been with smaller files
than 80GB ;-). You might want to consider

Oops! Thanks again. I thought that readlines() was the generator form,
based on the docstring comments about the deprecation of xreadlines().

I'm sorry but I am not sure where the mutable strings come in. Python
strings are immutable anyway. Well-known for it.

I misspoke. I think was mixing this up with the issue of object-creation
overhead for all of the string handling in general. Is this a bottleneck
to string processing in python, or is this a hangover from my Java days?
I would have thought that dumping the standard string processing
libraries in favor of byte manipulation would have been one of the
biggest wins.

Of course you leave us in the dark about the nature of
table.markEquivalent as well.

markEquivalent() implements union-join (aka, uptrees) to generate
equivalence classes. Optimising that was going to be my next task

I feel a bit silly for missing the double-processing of everything.
Thanks for pointing that out. And I will check out the biopython package.

I'm still curious if optimizing compilers are worth examining. For
instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm
guessing that both this tokenizing and the uptree implementations sound
like good candidates for one of those tools, once I shake out these
algorithmic problems.

alexis

Larry Bates · Feb 21, 2006

Alexis said:
Steve,

First, many thanks!

Good to know. I should have dug into the docs deeper. Somehow I thought
it listed lines not bytes.

Oops! Thanks again. I thought that readlines() was the generator form,
based on the docstring comments about the deprecation of xreadlines().

I misspoke. I think was mixing this up with the issue of object-creation
overhead for all of the string handling in general. Is this a bottleneck
to string processing in python, or is this a hangover from my Java days?
I would have thought that dumping the standard string processing
libraries in favor of byte manipulation would have been one of the
biggest wins.

markEquivalent() implements union-join (aka, uptrees) to generate
equivalence classes. Optimising that was going to be my next task

I feel a bit silly for missing the double-processing of everything.
Thanks for pointing that out. And I will check out the biopython package.

I'm still curious if optimizing compilers are worth examining. For
instance, I saw Pyrex and Pysco mentioned on earlier threads. I'm
guessing that both this tokenizing and the uptree implementations sound
like good candidates for one of those tools, once I shake out these
algorithmic problems.

alexis

When your problem is I/O bound there is almost nothing that can be
done to speed it up without some sort of refactoring of the input
data itself. Python reads bytes off a hard drive just as fast as
any compiled language. A good test is to copy the file and measure
the time. You can't make your program run any faster than a copy
of the file itself without making hardware changes (e.g. RAID
arrays, etc.).

You might also want to take a look at csv module. Reading lines
and splitting on delimeters is almost always handled well by csv.

-Larry Bates

Simple Text Processing	2	Sep 10, 2009
Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
Fast conversion of numbers to numerator/denominator pairs	5	Aug 24, 2013
Text processing	29	Sep 26, 2011
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
fast video encoding	4	Jul 29, 2009
Simple Text Processing Help	17	Oct 14, 2007
processing strings char-by-char	0	Aug 18, 2013

fast text processing

Alexis Gallagher

Steve Holden

Ben Sizer

Alexis Gallagher

Larry Bates

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads