Fast File Input

Scott Brady Drummonds · Feb 25, 2004

Hi, everyone,

I'm a relative novice to Python and am trying to reduce the processing time
for a very large text file that I am reading into my Python script. I'm
currently reading each line one at a time (readline()), stripping the
leading and trailing whitespace (strip()) and splitting it's delimited data
(split()). For my large input files, this text processing is taking many
hours.

If I were working in C, I'd consider using a lower level I/O library,
minimizing text processing, and reducing memory redundancy. However, I have
no idea at all what to do to optimize this process in Python.

Can anyone offer some suggestions?

Thanks,
Scott

P · Feb 25, 2004

Scott said:
Hi, everyone,

I'm a relative novice to Python and am trying to reduce the processing time
for a very large text file that I am reading into my Python script. I'm
currently reading each line one at a time (readline()), stripping the
leading and trailing whitespace (strip()) and splitting it's delimited data
(split()). For my large input files, this text processing is taking many
hours.

If I were working in C, I'd consider using a lower level I/O library,
minimizing text processing, and reducing memory redundancy. However, I have
no idea at all what to do to optimize this process in Python.

Can anyone offer some suggestions?

This actually improved a lot with python version 2
but is still quite slow as you can see here:
http://www.pixelbeat.org/readline/
There are a few notes within the python script there.

Pádraig.

Terry Reedy · Feb 25, 2004

Scott Brady Drummonds said:
Hi, everyone,

I'm a relative novice to Python and am trying to reduce the processing time
for a very large text file that I am reading into my Python script. I'm
currently reading each line one at a time (readline()), stripping the
leading and trailing whitespace (strip()) and splitting it's delimited data
(split()). For my large input files, this text processing is taking many
hours.

for line in file('somefile.txt'): ...
will be faster because the file iterator reads a much larger block with
each disk access.

Do you really need strip()? Clipping \n off the last item after split()
*might* be faster.

Terry J. Reedy

Andrei · Feb 25, 2004

Scott Brady Drummonds wrote on Wed, 25 Feb 2004 08:35:43 -0800:

Hi, everyone,

I'm a relative novice to Python and am trying to reduce the processing time
for a very large text file that I am reading into my Python script. I'm
currently reading each line one at a time (readline()), stripping the
leading and trailing whitespace (strip()) and splitting it's delimited data
(split()). For my large input files, this text processing is taking many
hours.

An easy improvement is using "for line in sometextfile:" instead of
repetitive readline(). Not sure how much time this will save you (depends
on what you're doing after reading), but it can make a difference at
virtually no cost. You might also want to try rstrip() instead of strip()
(not sure if it's faster, but perhaps it is).

--
Yours,

Andrei

=====
Real contact info (decode with rot13):
(e-mail address removed). Fcnz-serr! Cyrnfr qb abg hfr va choyvp cbfgf. V ernq
gur yvfg, fb gurer'f ab arrq gb PP.

Skip Montanaro · Feb 25, 2004

Pádraig> This actually improved a lot with python version 2
Pádraig> but is still quite slow as you can see here:
Pádraig> http://www.pixelbeat.org/readline/
Pádraig> There are a few notes within the python script there.

Your page doesn't mention precisely which version of Python 2 you used. I
suspect a rather old one (2.0? 2.1?) because of the style of loop you used
to read from sys.stdin. Eliminating comments, your python2 script was:

import sys

while 1:
line = sys.stdin.readline()
if line == '':
break
try:
print line,
except:
pass

Running that using the CVS version of Python feeding it my machine's
dictionary as input I got this time(1) output (fastest real time of four runs):

% time python readltst.py < /usr/share/dict/words > /dev/null

real 0m1.384s
user 0m1.290s
sys 0m0.060s

Rewriting it to eliminate the try/except statement (why did you have that
there?) got it to:

% time python readltst.py < /usr/share/dict/words > /dev/null

real 0m1.373s
user 0m1.270s
sys 0m0.040s

Further rewriting it as the more modern:

import sys

for line in sys.stdin:
print line,

yielded:

% time python readltst2.py < /usr/share/dict/words > /dev/null

real 0m0.660s
user 0m0.600s
sys 0m0.060s

My guess is that your python2 times are probably at least a factor of 2 too
large if you accept that people will use a recent version of Python in which
file objects are iterators.

Skip

Eddie Corns · Feb 25, 2004

Scott Brady Drummonds said:
message news:[email protected]...

If you mean delimited in the CSV sense then I believe that the CSV modules are
optimised for this. Included in 2.3 IIRC.

Eddie

Fast alternatives to "File" and "IO" for large numbers of files ?	6	Feb 24, 2011
Reading in cooked mode (was Re: Python MSI not installing, log fileshowing name of a Viatnemese comm	8	Mar 23, 2014
Fast way to process large files line by line	18	Nov 15, 2006
standard input, for s in f, and buffering	4	Mar 30, 2008
ANN: eGenix mxODBC Connect 2.1.0 - Python ODBC Database Interface	0	May 28, 2014
ANN: pyparsing 1.5.6 released!	1	Jul 1, 2011
print header for output	0	Jun 19, 2011
With this artifact, everyone can easily invent new languages	5	Jan 11, 2014

Fast File Input

Scott Brady Drummonds

P

Terry Reedy

Andrei

Skip Montanaro

Eddie Corns

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads