efficient data loading with Python, is that possible possible?

I

igor.tatarinov

Hi, I am pretty new to Python and trying to use it for a relatively
simple problem of loading a 5 million line text file and converting it
into a few binary files. The text file has a fixed format (like a
punchcard). The columns contain integer, real, and date values. The
output files are the same values in binary. I have to parse the values
and write the binary tuples out into the correct file based on a given
column. It's a little more involved but that's not important.

I have a C++ prototype of the parsing code and it loads a 5 Mline file
in about a minute. I was expecting the Python version to be 3-4 times
slower and I can live with that. Unfortunately, it's 20 times slower
and I don't see how I can fix that.

The fundamental difference is that in C++, I create a single object (a
line buffer) that's reused for each input line and column values are
extracted straight from that buffer without creating new string
objects. In python, new objects must be created and destroyed by the
million which must incur serious memory management overhead.

Correct me if I am wrong but

1) for line in file: ...
will create a new string object for every input line

2) line[start:end]
will create a new string object as well

3) int(time.mktime(time.strptime(s, "%m%d%y%H%M%S")))
will create 10 objects (since struct_time has 8 fields)

4) a simple test: line[i:j] + line[m:n] in hash
creates 3 strings and there is no way to avoid that.

I thought arrays would help but I can't load an array without creating
a string first: ar(line, start, end) is not supported.

I hope I am missing something. I really like Python but if there is no
way to process data efficiently, that seems to be a problem.

Thanks,
igor
 
G

George Sakkis

Hi, I am pretty new to Python and trying to use it for a relatively
simple problem of loading a 5 million line text file and converting it
into a few binary files. The text file has a fixed format (like a
punchcard). The columns contain integer, real, and date values. The
output files are the same values in binary. I have to parse the values
and write the binary tuples out into the correct file based on a given
column. It's a little more involved but that's not important.

I have a C++ prototype of the parsing code and it loads a 5 Mline file
in about a minute. I was expecting the Python version to be 3-4 times
slower and I can live with that. Unfortunately, it's 20 times slower
and I don't see how I can fix that.

The fundamental difference is that in C++, I create a single object (a
line buffer) that's reused for each input line and column values are
extracted straight from that buffer without creating new string
objects. In python, new objects must be created and destroyed by the
million which must incur serious memory management overhead.

Correct me if I am wrong but

1) for line in file: ...
will create a new string object for every input line

2) line[start:end]
will create a new string object as well

3) int(time.mktime(time.strptime(s, "%m%d%y%H%M%S")))
will create 10 objects (since struct_time has 8 fields)

4) a simple test: line[i:j] + line[m:n] in hash
creates 3 strings and there is no way to avoid that.

I thought arrays would help but I can't load an array without creating
a string first: ar(line, start, end) is not supported.

I hope I am missing something. I really like Python but if there is no
way to process data efficiently, that seems to be a problem.

20 times slower because of garbage collection sounds kinda fishy.
Posting some actual code usually helps; it's hard to tell for sure
otherwise.

George
 
J

John Machin

Hi, I am pretty new to Python and trying to use it for a relatively
simple problem of loading a 5 million line text file and converting it
into a few binary files. The text file has a fixed format (like a
punchcard). The columns contain integer, real, and date values. The
output files are the same values in binary. I have to parse the values
and write the binary tuples out into the correct file based on a given
column. It's a little more involved but that's not important.

I have a C++ prototype of the parsing code and it loads a 5 Mline file
in about a minute. I was expecting the Python version to be 3-4 times
slower and I can live with that. Unfortunately, it's 20 times slower
and I don't see how I can fix that.

The fundamental difference is that in C++, I create a single object (a
line buffer) that's reused for each input line and column values are
extracted straight from that buffer without creating new string
objects. In python, new objects must be created and destroyed by the
million which must incur serious memory management overhead.

Don't stress out about it; the core devs have put in a few neat
optimisations in the last approx 17 years :)
I hope I am missing something.

You probably are: there is a multitude of possible reasons why newbie
code in any language runs slowly. Twenty minutes to process 5M lines
does seem excessive. However without seeing your code we can't help
much.

int(time.mktime(time.strptime(s, "%m%d%y%H%M%S")))

can be improved by looking up the time module for those two functions
once per run rather than twice per date field. Inside your function
[you are doing all this inside a function, not at global level in a
script, aren't you?], do this:
from time import mktime, strptime # do this ONCE
...
blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))

It would help if you told us what platform, what version of Python,
how much memory, how much swap space, ...

Cheers,
John
 
I

igor.tatarinov

Inside your function
[you are doing all this inside a function, not at global level in a
script, aren't you?], do this:
from time import mktime, strptime # do this ONCE
...
blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))

It would help if you told us what platform, what version of Python,
how much memory, how much swap space, ...

Cheers,
John

I am using a global 'from time import ...'. I will try to do that
within the
function and see if it makes a difference.

The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
something like that. Python 2.4

Here is some of my code. Tell me what's wrong with it :)

def loadFile(inputFile, loader):
# .zip files don't work with zlib
f = popen('zcat ' + inputFile)
for line in f:
loader.handleLine(line)
...

In Loader class:
def handleLine(self, line):
# filter out 'wrong' lines
if not self._dataFormat(line): return

# add a new output record
rec = self.result.addRecord()

for col in self._dataFormat.colFormats:
value = parseValue(line, col)
rec[col.attr] = value

And here is parseValue (will using a hash-based dispatch make it much
faster?):

def parseValue(line, col):
s = line[col.start:col.end+1]
# no switch in python
if col.format == ColumnFormat.DATE:
return Format.parseDate(s)
if col.format == ColumnFormat.UNSIGNED:
return Format.parseUnsigned(s)
if col.format == ColumnFormat.STRING:
# and-or trick (no x ? y:z in python 2.4)
return not col.strip and s or rstrip(s)
if col.format == ColumnFormat.BOOLEAN:
return s == col.arg and 'Y' or 'N'
if col.format == ColumnFormat.PRICE:
return Format.parseUnsigned(s)/100.

And here is Format.parseDate() as an example:
def parseDate(s):
# missing (infinite) value ?
if s.startswith('999999') or s.startswith('000000'): return -1
return int(mktime(strptime(s, "%y%m%d")))

Hopefully, this should be enough to tell what's wrong with my code.

Thanks again,
igor
 
J

John Machin

Inside your function
[you are doing all this inside a function, not at global level in a
script, aren't you?], do this:
from time import mktime, strptime # do this ONCE
...
blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))
It would help if you told us what platform, what version of Python,
how much memory, how much swap space, ...
Cheers,
John

I am using a global 'from time import ...'. I will try to do that
within the
function and see if it makes a difference.

The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
something like that. Python 2.4

Here is some of my code. Tell me what's wrong with it :)

def loadFile(inputFile, loader):
# .zip files don't work with zlib
f = popen('zcat ' + inputFile)
for line in f:
loader.handleLine(line)
...

In Loader class:
def handleLine(self, line):
# filter out 'wrong' lines
if not self._dataFormat(line): return

# add a new output record
rec = self.result.addRecord()

for col in self._dataFormat.colFormats:
value = parseValue(line, col)
rec[col.attr] = value

And here is parseValue (will using a hash-based dispatch make it much
faster?):

def parseValue(line, col):
s = line[col.start:col.end+1]
# no switch in python
if col.format == ColumnFormat.DATE:
return Format.parseDate(s)
if col.format == ColumnFormat.UNSIGNED:
return Format.parseUnsigned(s)
if col.format == ColumnFormat.STRING:
# and-or trick (no x ? y:z in python 2.4)
return not col.strip and s or rstrip(s)
if col.format == ColumnFormat.BOOLEAN:
return s == col.arg and 'Y' or 'N'
if col.format == ColumnFormat.PRICE:
return Format.parseUnsigned(s)/100.

And here is Format.parseDate() as an example:
def parseDate(s):
# missing (infinite) value ?
if s.startswith('999999') or s.startswith('000000'): return -1
return int(mktime(strptime(s, "%y%m%d")))

Hopefully, this should be enough to tell what's wrong with my code.

I have to go out now, so here's a quick overview: too many goddam dots
and too many goddam method calls.
1. do
colfmt = col.format # ONCE
if colfmt == ...
2. No switch so put most frequent at the top
3. What is ColumnFormat? What is Format? I think you have gone class-
crazy, and there's more overhead than working code ...

Cheers,
John
 
S

Steven D'Aprano

Hi, I am pretty new to Python and trying to use it for a relatively
simple problem of loading a 5 million line text file and converting it
into a few binary files. The text file has a fixed format (like a
punchcard). The columns contain integer, real, and date values. The
output files are the same values in binary. I have to parse the values
and write the binary tuples out into the correct file based on a given
column. It's a little more involved but that's not important.

I suspect that this actually is important, and that your slowdown has
everything to do with the stuff you dismiss and nothing to do with
Python's object model or execution speed.

I have a C++ prototype of the parsing code and it loads a 5 Mline file
in about a minute. I was expecting the Python version to be 3-4 times
slower and I can live with that. Unfortunately, it's 20 times slower and
I don't see how I can fix that.

I've run a quick test on my machine with a mere 1GB of RAM, reading the
entire file into memory at once, and then doing some quick processing on
each line:

.... fp = open(name, 'w')
.... for i in xrange(size):
.... fp.write('here is a bunch of text with a newline\n')
.... fp.close()
........ import time
.... start = time.time()
.... fp = open(name, 'r')
.... for line in fp.readlines():
.... line = line.strip()
.... words = line.split()
.... fp.close()
.... return time.time() - start
....22.53150200843811

Twenty two seconds to read five million lines and split them into words.
I suggest the other nineteen minutes and forty-odd seconds your code is
taking has something to do with your code and not Python's execution
speed.

Of course, I wouldn't normally read all 5M lines into memory in one big
chunk. Replace the code

for line in fp.readlines():

with

for line in fp:

and the time drops from 22 seconds to 16.
 
D

DouhetSukd

Back about 8 yrs ago, on pc hardware, I was reading twin 5 Mb files
and doing a 'fancy' diff between the 2, in about 60 seconds. Granted,
your file is likely bigger, but so is modern hardware and 20 mins does
seem a bit high.

Can't talk about the rest of your code, but some parts of it may be
optimized

def parseValue(line, col):
s = line[col.start:col.end+1]
# no switch in python
if col.format == ColumnFormat.DATE:
return Format.parseDate(s)
if col.format == ColumnFormat.UNSIGNED:
return Format.parseUnsigned(s)

How about taking the big if clause out? That would require making all
the formatters into functions, rather than in-lining some of them, but
it may clean things up.

#prebuilding a lookup of functions vs. expected formats...
#This is done once.
#Remember, you have to position this dict's computation _after_ all
the Format.parseXXX declarations. Don't worry, Python _will_ complain
if you don't.

dict_format_func = {ColumnFormat.DATE:Format.parseDate,
ColumnFormat.UNSIGNED:Format.parseUnsigned,
....

def parseValue(line, col):
s = line[col.start:col.end+1]

#get applicable function, apply it to s
return dict_format_func[col.format](s)

Also...

if col.format == ColumnFormat.STRING:
# and-or trick (no x ? y:z in python 2.4)
return not col.strip and s or rstrip(s)

Watch out! 'col.strip' here is not the result of stripping the
column, it is the strip _function_ itself, bound to the col object, so
it always be true. I get caught by those things all the time :-(

I agree that taking out the dot.dot.dots would help, but I wouldn't
expect it to matter that much, unless it was in an incredibly tight
loop.

I might be that.

if s.startswith('999999') or s.startswith('000000'): return -1

would be better as...

#outside of loop, define a set of values for which you want to return
-1
set_return = set(['999999','000000'])

#lookup first 6 chars in your set
def parseDate(s):
if s[0:6] in set_return:
return -1
return int(mktime(strptime(s, "%y%m%d")))

Bottom line: Python built-in data objects, such as dictionaries and
sets, are very much optimized. Relying on them, rather than writing a
lot of ifs and doing weird data structure manipulations in Python
itself, is a good approach to try. Try to build those objects outside
of your main processing loops.

Cheers

Douhet-did-suck
 
S

Steven D'Aprano

Here is some of my code. Tell me what's wrong with it :)

def loadFile(inputFile, loader):
# .zip files don't work with zlib
Pardon?

f = popen('zcat ' + inputFile)
for line in f:
loader.handleLine(line)

Do you really need to compress the file? Five million lines isn't a lot.
It depends on the length of each line, naturally, but I'd be surprised if
it were more than 100MB.
...

In Loader class:
def handleLine(self, line):
# filter out 'wrong' lines
if not self._dataFormat(line): return


Who knows what the _dataFormat() method does? How complicated is it? Why
is it a private method?

# add a new output record
rec = self.result.addRecord()

Who knows what this does? How complicated it is?

for col in self._dataFormat.colFormats:

Hmmm... a moment ago, _dataFormat seemed to be a method, or at least a
callable. Now it has grown a colFormats attribute. Complicated and
confusing.

value = parseValue(line, col)
rec[col.attr] = value

And here is parseValue (will using a hash-based dispatch make it much
faster?):

Possibly, but not enough to reduce 20 minutes to one or two.

But you know something? Your code looks like a bad case of over-
generalisation. I assume it's a translation of your C++ code -- no wonder
it takes an entire minute to process the file! (Oh lord, did I just say
that???) Object-oriented programming is a useful tool, but sometimes you
don't need a HyperDispatcherLoaderManagerCreator, you just need a hammer.

In your earlier post, you gave the data specification:

"The text file has a fixed format (like a punchcard). The columns contain
integer, real, and date values. The output files are the same values in
binary."

Easy-peasy. First, some test data:


fp = open('BIG', 'w')
for i in xrange(5000000):
anInt = i % 3000
aBool = ['TRUE', 'YES', '1', 'Y', 'ON',
'FALSE', 'NO', '0', 'N', 'OFF'][i % 10]
aFloat = ['1.12', '-3.14', '0.0', '7.42'][i % 4]
fp.write('%s %s %s\n' % (anInt, aBool, aFloat))
if i % 45000 == 0:
# Write a comment and a blank line.
fp.write('# this is a comment\n \n')

fp.close()



Now let's process it:


import struct

# Define converters for each type of value to binary.
def fromBool(s):
"""String to boolean byte."""
s = s.upper()
if s in ('TRUE', 'YES', '1', 'Y', 'ON'):
return struct.pack('b', True)
elif s in ('FALSE', 'NO', '0', 'N', 'OFF'):
return struct.pack('b', False)
else:
raise ValueError('not a valid boolean')

def fromInt(s):
"""String to integer bytes."""
return struct.pack('l', int(s))

def fromFloat(s):
"""String to floating point bytes."""
return struct.pack('f', float(s))


# Assume three fields...
DEFAULT_FORMAT = [fromInt, fromBool, fromFloat]

# And three files...
OUTPUT_FILES = ['ints.out', 'bools.out', 'floats.out']


def process_line(s, format=DEFAULT_FORMAT):
s = s.strip()
fields = s.split() # I assume the fields are whitespace separated
assert len(fields) == len(format)
return [f(x) for (x, f) in zip(fields, format)]

def process_file(infile, outfiles=OUTPUT_FILES):
out = [open(f, 'wb') for f in outfiles]
for line in file(infile, 'r'):
# ignore leading/trailing whitespace and comments
line = line.strip()
if line and not line.startswith('#'):
fields = process_line(line)
# now write the fields to the files
for x, fp in zip(fields, out):
fp.write(x)
for f in out:
f.close()



And now let's use it and see how long it takes:
129.58465385437012


Naturally if your converters are more complex (e.g. date-time), or if you
have more fields, it will take longer to process, but then I've made no
effort at all to optimize the code.
 
B

bearophileHUGS

igor:
The fundamental difference is that in C++, I create a single object (a
line buffer) that's reused for each input line and column values are
extracted straight from that buffer without creating new string
objects. In python, new objects must be created and destroyed by the
million which must incur serious memory management overhead.

Python creates indeed many objects (as I think Tim once said "it
allocates memory at a ferocious rate"), but the management of memory
is quite efficient. And you may use the JIT Psyco (that's currently
1000 times more useful than PyPy, despite sadly not being developed
anymore) that in some situations avoids data copying (example: in
slices). Python is designed for string processing, and from my
experience string processing Psyco programs may be faster than similar
not-optimized-to-death C++/D programs (you can see that manually
crafted code, or from ShedSkin that's often slower than Psyco during
string processing). But in every language I know to gain performance
you need to know the language, and Python isn't C++, so other kinds of
tricks are necessary.

The following advice is useful too:

DouhetSukd:
Bottom line: Python built-in data objects, such as dictionaries and
sets, are very much optimized. Relying on them, rather than writing a
lot of ifs and doing weird data structure manipulations in Python
itself, is a good approach to try. Try to build those objects outside
of your main processing loops.<

Bye,
bearophile
 
N

Neil Cerutti

Inside your function
[you are doing all this inside a function, not at global level in a
script, aren't you?], do this:
from time import mktime, strptime # do this ONCE
...
blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))

It would help if you told us what platform, what version of Python,
how much memory, how much swap space, ...

Cheers,
John

I am using a global 'from time import ...'. I will try to do that
within the
function and see if it makes a difference.

The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
something like that. Python 2.4

Here is some of my code. Tell me what's wrong with it :)

def loadFile(inputFile, loader):
# .zip files don't work with zlib
f = popen('zcat ' + inputFile)
for line in f:
loader.handleLine(line)
...

In Loader class:
def handleLine(self, line):
# filter out 'wrong' lines
if not self._dataFormat(line): return

# add a new output record
rec = self.result.addRecord()

for col in self._dataFormat.colFormats:
value = parseValue(line, col)
rec[col.attr] = value

def parseValue(line, col):
s = line[col.start:col.end+1]
# no switch in python
if col.format == ColumnFormat.DATE:
return Format.parseDate(s)
if col.format == ColumnFormat.UNSIGNED:
return Format.parseUnsigned(s)
if col.format == ColumnFormat.STRING:
# and-or trick (no x ? y:z in python 2.4)
return not col.strip and s or rstrip(s)
if col.format == ColumnFormat.BOOLEAN:
return s == col.arg and 'Y' or 'N'
if col.format == ColumnFormat.PRICE:
return Format.parseUnsigned(s)/100.

And here is Format.parseDate() as an example:
def parseDate(s):
# missing (infinite) value ?
if s.startswith('999999') or s.startswith('000000'): return -1
return int(mktime(strptime(s, "%y%m%d")))

An inefficient parsing technique is probably to blame. You first
inspect the line to make sure it is valid, then you inspect it
(number of column type) times to discover what data type it
contains, and then you inspect it *again* to finally translate
it.
And here is parseValue (will using a hash-based dispatch make
it much faster?):

Not much.

You should be able to validate, recognize and translate all in
one pass. Get pyparsing to help, if need be.

What does your data look like?
 
C

Chris Gonnerman

Neil said:
An inefficient parsing technique is probably to blame. You first
inspect the line to make sure it is valid, then you inspect it
(number of column type) times to discover what data type it
contains, and then you inspect it *again* to finally translate
it.
I was thinking just that. It is much more "pythonic" to simply attempt
to convert the values in whatever fashion they are supposed to be
converted, and handle errors in data format by means of exceptions.
IMO, of course. In the "trivial" case, where there are no errors in the
data file, this is a heck of a lot faster.

-- Chris.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,439
Members
44,829
Latest member
PIXThurman

Latest Threads

Top