speeding up reading files (possibly with cython)

P

per

hi all,

i have a program that essentially loops through a textfile file thats
about 800 MB in size containing tab separated data... my program
parses this file and stores its fields in a dictionary of lists.

for line in file:
split_values = line.strip().split('\t')
# do stuff with split_values

currently, this is very slow in python, even if all i do is break up
each line using split() and store its values in a dictionary, indexing
by one of the tab separated values in the file.

is this just an overhead of python that's inevitable? do you guys
think that switching to cython might speed this up, perhaps by
optimizing the main for loop? or is this not a viable option?

thank you.
 
S

skip

...

Why not use the csv module and specify TAB as your delimiter?

reader = csv.reader(open(fname, "rb"))
for row in reader:
...
 
S

skip

me> reader = csv.reader(open(fname, "rb"))
me> for row in reader:
me> ...

duh... How about

reader = csv.reader(open(fname, "rb"), delimiter='\t')
for row in reader:
...

S
 
T

Tim Chase

i have a program that essentially loops through a textfile file thats
about 800 MB in size containing tab separated data... my program
parses this file and stores its fields in a dictionary of lists.

for line in file:
split_values = line.strip().split('\t')
# do stuff with split_values

currently, this is very slow in python, even if all i do is break up
each line using split() and store its values in a dictionary, indexing
by one of the tab separated values in the file.

I'm not sure what the situation is, but I regularly skim through
tab-delimited files of similar size and haven't noticed any
problems like you describe. You might try tweaking the optional
(and infrequently specified) bufsize parameter of the
open()/file() call:

bufsize = 4 * 1024 * 1024 # buffer 4 megs at a time
f = file('in.txt', 'r', bufsize)
for line in f:
split_values = line.strip().split('\t')
# do stuff with split_values

If not specified, you're at the mercy of the system-default
(perhaps OS specific?). You can read more at[1] along with the
associated warning about setvbuf()

-tkc


[1]
http://docs.python.org/library/functions.html#open
 
J

John Machin

hi all,

i have a program that essentially loops through a textfile file thats
about 800 MB in size containing tab separated data... my program
parses this file and stores its fields in a dictionary of lists.

for line in file:
  split_values = line.strip().split('\t')

line.strip() is NOT a very good idea because it strips all whitespace
including tabs.

line.rstrip('\n') is sufficient.

BUT as Skip has pointed out, you should be using the csv module
anyway.

An 800Mb file is unlikely to have been written by Excel. Excel has
this stupid idea of wrapping quotes around fields that contain commas
(and quotes) even when the field delimiter is NOT a comma.

Experiment: open Excel, enter the following 4 strings in cells A1:D1
normal
embedded,comma
"Hello"embedded-quote
normality returns
then save as Text (Tab-delimited).

Here's what you get:
| >>> open('Excel_tab_delimited.txt', 'rb').read()
| 'normal\t"embedded,comma"\t"embedded""Hello""quote"\tnormality
returns\r\n'
| >>>
  # do stuff with split_values

currently, this is very slow in python, even if all i do is break up
each line using split() and store its values in a dictionary, indexing
by one of the tab separated values in the file.

is this just an overhead of python that's inevitable? do you guys
think that switching to cython might speed this up, perhaps by
optimizing the main for loop?  or is this not a viable option?

You are unlikely to get much speed-up .. I'd expect that the loop
overhead would be a tiny part of the execution time.
 
J

John Machin

hi all,

i have a program that essentially loops through a textfile file thats
about 800 MB in size containing tab separated data... my program
parses this file and stores its fields in a dictionary of lists.

for line in file:
  split_values = line.strip().split('\t')

line.strip() will strip all leading/trailing whitespace *including*
*tabs*. Not a good idea. Use line.rstrip('\n') -- anything more is
losing data.
  # do stuff with split_values

currently, this is very slow in python, even if all i do is break up
each line using split() and store its values in a dictionary, indexing
by one of the tab separated values in the file.

is this just an overhead of python that's inevitable? do you guys
think that switching to cython might speed this up, perhaps by
optimizing the main for loop?  or is this not a viable option?

Not much point in using Cython IMO; loop overhead would be expected to
be a tiny part of the time.

Using the csv module is recommended. However a *WARNING*:

When you save as "Text (Tab delimited)" Excel unnecessarily quotes
embedded commas and quotes.

csv.reader(..., delimiter='\t') acts like Excel reading back its own
output and thus is likely to mangle any quotes that are actually part
of the data, if the writer did not use the same "protocol".

An 800MB file is unlikely to have been created by Excel :) Presuming
your file was created using '\t'.join(list_of_strings) or equivalent,
you need to use csv.reader(..., delimiter='\t',
quoting=csv.QUOTE_NONE)

For example:
| >>> import csv
| >>> open('Excel_tab_delimited.txt', 'rb').read()
| 'normal\t"embedded,comma"\t"""Hello""embedded-quote"\r\n'
| >>> f = open('simple.tsv', 'wb')
| >>> f.write('normal\tembedded,comma\t"Hello"embedded-quote\r\n')
| >>> f.close()
| >>> list(csv.reader(open('Excel_tab_delimited.txt', 'rb'),
delimiter='\t'))
| [['normal', 'embedded,comma', '"Hello"embedded-quote']]
| >>> list(csv.reader(open('simple.tsv', 'rb'), delimiter='\t'))
| [['normal', 'embedded,comma', 'Helloembedded-quote']]
| # Whoops!
| >>> list(csv.reader(open('simple.tsv', 'rb'), delimiter='\t',
quoting=csv.QUOTE_NONE))
| [['normal', 'embedded,comma', '"Hello"embedded-quote']]
| # OK
| >>>

HTH,
John
 
S

Steven D'Aprano

per said:
hi all,

i have a program that essentially loops through a textfile file thats
about 800 MB in size containing tab separated data... my program
parses this file and stores its fields in a dictionary of lists.

for line in file:
split_values = line.strip().split('\t')
# do stuff with split_values

currently, this is very slow in python, even if all i do is break up
each line using split() and store its values in a dictionary, indexing
by one of the tab separated values in the file.

is this just an overhead of python that's inevitable? do you guys
think that switching to cython might speed this up, perhaps by
optimizing the main for loop? or is this not a viable option?

Any time I see large data structures, I always think of memory consumption
and paging. How much memory do you have? My back-of-the-envelope estimate
is that you need at least 1.2 GB to store the 800MB of text, more if the
text is Unicode or if you're on a 64-bit system. If your computer only has
1GB of memory, it's going to be struggling; if it has 2GB, it might be a
little slow, especially if you're running other programs at the same time.

If that's the problem, the solution is: get more memory.

Apart from monitoring virtual memory use, another test you could do is to
see if the time taken to build the data structures scales approximately
linearly with the size of the data. That is, if it takes 2 seconds to read
80MB of data and store it in lists, then it should take around 4 seconds to
do 160MB and 20-30 seconds to do 800MB. If your results are linear, then
there's probably nothing much you can do to speed it up, since the time it
probably dominated by file I/O.

On the other hand, if the time scales worse than linear, there may be hope
to speed it up.
 
P

Peter Otten

per said:
i have a program that essentially loops through a textfile file thats
about 800 MB in size containing tab separated data... my program
parses this file and stores its fields in a dictionary of lists.

for line in file:
split_values = line.strip().split('\t')
# do stuff with split_values

currently, this is very slow in python, even if all i do is break up
each line using split() and store its values in a dictionary, indexing
by one of the tab separated values in the file.

is this just an overhead of python that's inevitable? do you guys
think that switching to cython might speed this up, perhaps by
optimizing the main for loop? or is this not a viable option?

For the general approach and the overall speed of your program it does
matter what you want to do with the data once you've read it -- can you
tell us a bit about that?

Peter
 
C

Carl Banks

hi all,

i have a program that essentially loops through a textfile file thats
about 800 MB in size containing tab separated data... my program
parses this file and stores its fields in a dictionary of lists.

When building a very large structure like you're doing, the cyclic
garbage collector can be a bottleneck. Try disabling the cyclic
garbage collector before building the large dictionary, and re-
enabling it afterwards.

import gc
gc.disable()
try:
for line in file:
split_values = line.strip().split('\t')
# do stuff with split_values
finally:
gc.enable()



Carl Banks
 
T

Timothy N. Tsvetkov

If that's the problem, the solution is: get more memory.

Or maybe think about algorithm, which needs less memory... My
experience tells me, that each time when you want to store a lot of
data into dict (or other structure) to analyze it then, you can find a
way not to store so much amount of data %)
 
T

Tim Chase

Steven said:
If that's the problem, the solution is: get more memory.

Steven caught the "and store its values in a dictionary" (which I
missed previously and accentuated in the above quote). The one
missing pair of factors you omitted:

1) how many *lines* are in this file (or what's the average
line-length). You can use the following code both to find out
how many lines are in the file, and to see how long it takes
Python to skim through an 800 meg file just in terms of file-I/O:

i = 0
for line in file('in.txt'):
i += 1
print "%i lines" % i

2) how much overlap/commonality is there in the keys between
lines? Does every line create a new key, in which case you're
adding $LINES keys to your dictionary? or do some percentage of
lines overwrite entries in your dictionary with new values?
After one of your slow runs, issue a

print len(my_dict)

to see how many keys are in the final dict.

If you end up having millions of keys into your dict, you may be
able to use the "bdb" module to store your dict on-disk and save
memory. Doing access to *two* files may not get you great wins
in speed, but you at least won't be thrashing your virtual memory
with a huge dict, so performance in the rest of your app may not
experience similar problems due to swapping into virtual memory.
This has the added advantage that, if your input file doesn't
change, you can simply reuse the bdb database/dict file without
the need to rebuild its contents.

-tkc
 
S

S Arrowsmith

Carl Banks said:
When building a very large structure like you're doing, the cyclic
garbage collector can be a bottleneck. Try disabling the cyclic
garbage collector before building the large dictionary, and re-
enabling it afterwards.

import gc
gc.disable()
try:
for line in file:
split_values =3D line.strip().split('\t')
# do stuff with split_values
finally:
gc.enable()

Completely untested, but if you find yourself doing that a lot,
might:

import gc
from contextlib import contextmanager

@contextmanager
def no_gc():
gc.disable()
yield
gc.enable()

with no_gc():
for line in file:
# ... etc.

be worth considering?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top