efficient data loading with Python, is that possible possible?

Discussion in 'Python' started by igor.tatarinov@gmail.com, Dec 12, 2007.

  1. Guest

    Hi, I am pretty new to Python and trying to use it for a relatively
    simple problem of loading a 5 million line text file and converting it
    into a few binary files. The text file has a fixed format (like a
    punchcard). The columns contain integer, real, and date values. The
    output files are the same values in binary. I have to parse the values
    and write the binary tuples out into the correct file based on a given
    column. It's a little more involved but that's not important.

    I have a C++ prototype of the parsing code and it loads a 5 Mline file
    in about a minute. I was expecting the Python version to be 3-4 times
    slower and I can live with that. Unfortunately, it's 20 times slower
    and I don't see how I can fix that.

    The fundamental difference is that in C++, I create a single object (a
    line buffer) that's reused for each input line and column values are
    extracted straight from that buffer without creating new string
    objects. In python, new objects must be created and destroyed by the
    million which must incur serious memory management overhead.

    Correct me if I am wrong but

    1) for line in file: ...
    will create a new string object for every input line

    2) line[start:end]
    will create a new string object as well

    3) int(time.mktime(time.strptime(s, "%m%d%y%H%M%S")))
    will create 10 objects (since struct_time has 8 fields)

    4) a simple test: line[i:j] + line[m:n] in hash
    creates 3 strings and there is no way to avoid that.

    I thought arrays would help but I can't load an array without creating
    a string first: ar(line, start, end) is not supported.

    I hope I am missing something. I really like Python but if there is no
    way to process data efficiently, that seems to be a problem.

    Thanks,
    igor
     
    , Dec 12, 2007
    #1
    1. Advertising

  2. On Dec 12, 5:48 pm, wrote:
    > Hi, I am pretty new to Python and trying to use it for a relatively
    > simple problem of loading a 5 million line text file and converting it
    > into a few binary files. The text file has a fixed format (like a
    > punchcard). The columns contain integer, real, and date values. The
    > output files are the same values in binary. I have to parse the values
    > and write the binary tuples out into the correct file based on a given
    > column. It's a little more involved but that's not important.
    >
    > I have a C++ prototype of the parsing code and it loads a 5 Mline file
    > in about a minute. I was expecting the Python version to be 3-4 times
    > slower and I can live with that. Unfortunately, it's 20 times slower
    > and I don't see how I can fix that.
    >
    > The fundamental difference is that in C++, I create a single object (a
    > line buffer) that's reused for each input line and column values are
    > extracted straight from that buffer without creating new string
    > objects. In python, new objects must be created and destroyed by the
    > million which must incur serious memory management overhead.
    >
    > Correct me if I am wrong but
    >
    > 1) for line in file: ...
    > will create a new string object for every input line
    >
    > 2) line[start:end]
    > will create a new string object as well
    >
    > 3) int(time.mktime(time.strptime(s, "%m%d%y%H%M%S")))
    > will create 10 objects (since struct_time has 8 fields)
    >
    > 4) a simple test: line[i:j] + line[m:n] in hash
    > creates 3 strings and there is no way to avoid that.
    >
    > I thought arrays would help but I can't load an array without creating
    > a string first: ar(line, start, end) is not supported.
    >
    > I hope I am missing something. I really like Python but if there is no
    > way to process data efficiently, that seems to be a problem.


    20 times slower because of garbage collection sounds kinda fishy.
    Posting some actual code usually helps; it's hard to tell for sure
    otherwise.

    George
     
    George Sakkis, Dec 12, 2007
    #2
    1. Advertising

  3. John Machin Guest

    On Dec 13, 9:48 am, wrote:
    > Hi, I am pretty new to Python and trying to use it for a relatively
    > simple problem of loading a 5 million line text file and converting it
    > into a few binary files. The text file has a fixed format (like a
    > punchcard). The columns contain integer, real, and date values. The
    > output files are the same values in binary. I have to parse the values
    > and write the binary tuples out into the correct file based on a given
    > column. It's a little more involved but that's not important.
    >
    > I have a C++ prototype of the parsing code and it loads a 5 Mline file
    > in about a minute. I was expecting the Python version to be 3-4 times
    > slower and I can live with that. Unfortunately, it's 20 times slower
    > and I don't see how I can fix that.
    >
    > The fundamental difference is that in C++, I create a single object (a
    > line buffer) that's reused for each input line and column values are
    > extracted straight from that buffer without creating new string
    > objects. In python, new objects must be created and destroyed by the
    > million which must incur serious memory management overhead.


    Don't stress out about it; the core devs have put in a few neat
    optimisations in the last approx 17 years :)

    > I hope I am missing something.


    You probably are: there is a multitude of possible reasons why newbie
    code in any language runs slowly. Twenty minutes to process 5M lines
    does seem excessive. However without seeing your code we can't help
    much.

    int(time.mktime(time.strptime(s, "%m%d%y%H%M%S")))

    can be improved by looking up the time module for those two functions
    once per run rather than twice per date field. Inside your function
    [you are doing all this inside a function, not at global level in a
    script, aren't you?], do this:
    from time import mktime, strptime # do this ONCE
    ...
    blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))

    It would help if you told us what platform, what version of Python,
    how much memory, how much swap space, ...

    Cheers,
    John
     
    John Machin, Dec 13, 2007
    #3
  4. Guest

    On Dec 12, 4:03 pm, John Machin <> wrote:
    > Inside your function
    > [you are doing all this inside a function, not at global level in a
    > script, aren't you?], do this:
    > from time import mktime, strptime # do this ONCE
    > ...
    > blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))
    >
    > It would help if you told us what platform, what version of Python,
    > how much memory, how much swap space, ...
    >
    > Cheers,
    > John


    I am using a global 'from time import ...'. I will try to do that
    within the
    function and see if it makes a difference.

    The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
    something like that. Python 2.4

    Here is some of my code. Tell me what's wrong with it :)

    def loadFile(inputFile, loader):
    # .zip files don't work with zlib
    f = popen('zcat ' + inputFile)
    for line in f:
    loader.handleLine(line)
    ...

    In Loader class:
    def handleLine(self, line):
    # filter out 'wrong' lines
    if not self._dataFormat(line): return

    # add a new output record
    rec = self.result.addRecord()

    for col in self._dataFormat.colFormats:
    value = parseValue(line, col)
    rec[col.attr] = value

    And here is parseValue (will using a hash-based dispatch make it much
    faster?):

    def parseValue(line, col):
    s = line[col.start:col.end+1]
    # no switch in python
    if col.format == ColumnFormat.DATE:
    return Format.parseDate(s)
    if col.format == ColumnFormat.UNSIGNED:
    return Format.parseUnsigned(s)
    if col.format == ColumnFormat.STRING:
    # and-or trick (no x ? y:z in python 2.4)
    return not col.strip and s or rstrip(s)
    if col.format == ColumnFormat.BOOLEAN:
    return s == col.arg and 'Y' or 'N'
    if col.format == ColumnFormat.PRICE:
    return Format.parseUnsigned(s)/100.

    And here is Format.parseDate() as an example:
    def parseDate(s):
    # missing (infinite) value ?
    if s.startswith('999999') or s.startswith('000000'): return -1
    return int(mktime(strptime(s, "%y%m%d")))

    Hopefully, this should be enough to tell what's wrong with my code.

    Thanks again,
    igor
     
    , Dec 13, 2007
    #4
  5. John Machin Guest

    On Dec 13, 11:44 am, wrote:
    > On Dec 12, 4:03 pm, John Machin <> wrote:
    >
    > > Inside your function
    > > [you are doing all this inside a function, not at global level in a
    > > script, aren't you?], do this:
    > > from time import mktime, strptime # do this ONCE
    > > ...
    > > blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))

    >
    > > It would help if you told us what platform, what version of Python,
    > > how much memory, how much swap space, ...

    >
    > > Cheers,
    > > John

    >
    > I am using a global 'from time import ...'. I will try to do that
    > within the
    > function and see if it makes a difference.
    >
    > The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
    > something like that. Python 2.4
    >
    > Here is some of my code. Tell me what's wrong with it :)
    >
    > def loadFile(inputFile, loader):
    > # .zip files don't work with zlib
    > f = popen('zcat ' + inputFile)
    > for line in f:
    > loader.handleLine(line)
    > ...
    >
    > In Loader class:
    > def handleLine(self, line):
    > # filter out 'wrong' lines
    > if not self._dataFormat(line): return
    >
    > # add a new output record
    > rec = self.result.addRecord()
    >
    > for col in self._dataFormat.colFormats:
    > value = parseValue(line, col)
    > rec[col.attr] = value
    >
    > And here is parseValue (will using a hash-based dispatch make it much
    > faster?):
    >
    > def parseValue(line, col):
    > s = line[col.start:col.end+1]
    > # no switch in python
    > if col.format == ColumnFormat.DATE:
    > return Format.parseDate(s)
    > if col.format == ColumnFormat.UNSIGNED:
    > return Format.parseUnsigned(s)
    > if col.format == ColumnFormat.STRING:
    > # and-or trick (no x ? y:z in python 2.4)
    > return not col.strip and s or rstrip(s)
    > if col.format == ColumnFormat.BOOLEAN:
    > return s == col.arg and 'Y' or 'N'
    > if col.format == ColumnFormat.PRICE:
    > return Format.parseUnsigned(s)/100.
    >
    > And here is Format.parseDate() as an example:
    > def parseDate(s):
    > # missing (infinite) value ?
    > if s.startswith('999999') or s.startswith('000000'): return -1
    > return int(mktime(strptime(s, "%y%m%d")))
    >
    > Hopefully, this should be enough to tell what's wrong with my code.
    >


    I have to go out now, so here's a quick overview: too many goddam dots
    and too many goddam method calls.
    1. do
    colfmt = col.format # ONCE
    if colfmt == ...
    2. No switch so put most frequent at the top
    3. What is ColumnFormat? What is Format? I think you have gone class-
    crazy, and there's more overhead than working code ...

    Cheers,
    John
     
    John Machin, Dec 13, 2007
    #5
  6. On Wed, 12 Dec 2007 14:48:03 -0800, igor.tatarinov wrote:

    > Hi, I am pretty new to Python and trying to use it for a relatively
    > simple problem of loading a 5 million line text file and converting it
    > into a few binary files. The text file has a fixed format (like a
    > punchcard). The columns contain integer, real, and date values. The
    > output files are the same values in binary. I have to parse the values
    > and write the binary tuples out into the correct file based on a given
    > column. It's a little more involved but that's not important.


    I suspect that this actually is important, and that your slowdown has
    everything to do with the stuff you dismiss and nothing to do with
    Python's object model or execution speed.


    > I have a C++ prototype of the parsing code and it loads a 5 Mline file
    > in about a minute. I was expecting the Python version to be 3-4 times
    > slower and I can live with that. Unfortunately, it's 20 times slower and
    > I don't see how I can fix that.


    I've run a quick test on my machine with a mere 1GB of RAM, reading the
    entire file into memory at once, and then doing some quick processing on
    each line:


    >>> def make_big_file(name, size=5000000):

    .... fp = open(name, 'w')
    .... for i in xrange(size):
    .... fp.write('here is a bunch of text with a newline\n')
    .... fp.close()
    ....
    >>> make_big_file('BIG')
    >>>
    >>> def test(name):

    .... import time
    .... start = time.time()
    .... fp = open(name, 'r')
    .... for line in fp.readlines():
    .... line = line.strip()
    .... words = line.split()
    .... fp.close()
    .... return time.time() - start
    ....
    >>> test('BIG')

    22.53150200843811

    Twenty two seconds to read five million lines and split them into words.
    I suggest the other nineteen minutes and forty-odd seconds your code is
    taking has something to do with your code and not Python's execution
    speed.

    Of course, I wouldn't normally read all 5M lines into memory in one big
    chunk. Replace the code

    for line in fp.readlines():

    with

    for line in fp:

    and the time drops from 22 seconds to 16.



    --
    Steven
     
    Steven D'Aprano, Dec 13, 2007
    #6
  7. DouhetSukd Guest

    Back about 8 yrs ago, on pc hardware, I was reading twin 5 Mb files
    and doing a 'fancy' diff between the 2, in about 60 seconds. Granted,
    your file is likely bigger, but so is modern hardware and 20 mins does
    seem a bit high.

    Can't talk about the rest of your code, but some parts of it may be
    optimized

    def parseValue(line, col):
    s = line[col.start:col.end+1]
    # no switch in python
    if col.format == ColumnFormat.DATE:
    return Format.parseDate(s)
    if col.format == ColumnFormat.UNSIGNED:
    return Format.parseUnsigned(s)

    How about taking the big if clause out? That would require making all
    the formatters into functions, rather than in-lining some of them, but
    it may clean things up.

    #prebuilding a lookup of functions vs. expected formats...
    #This is done once.
    #Remember, you have to position this dict's computation _after_ all
    the Format.parseXXX declarations. Don't worry, Python _will_ complain
    if you don't.

    dict_format_func = {ColumnFormat.DATE:Format.parseDate,
    ColumnFormat.UNSIGNED:Format.parseUnsigned,
    ....

    def parseValue(line, col):
    s = line[col.start:col.end+1]

    #get applicable function, apply it to s
    return dict_format_func[col.format](s)

    Also...

    if col.format == ColumnFormat.STRING:
    # and-or trick (no x ? y:z in python 2.4)
    return not col.strip and s or rstrip(s)

    Watch out! 'col.strip' here is not the result of stripping the
    column, it is the strip _function_ itself, bound to the col object, so
    it always be true. I get caught by those things all the time :-(

    I agree that taking out the dot.dot.dots would help, but I wouldn't
    expect it to matter that much, unless it was in an incredibly tight
    loop.

    I might be that.

    if s.startswith('999999') or s.startswith('000000'): return -1

    would be better as...

    #outside of loop, define a set of values for which you want to return
    -1
    set_return = set(['999999','000000'])

    #lookup first 6 chars in your set
    def parseDate(s):
    if s[0:6] in set_return:
    return -1
    return int(mktime(strptime(s, "%y%m%d")))

    Bottom line: Python built-in data objects, such as dictionaries and
    sets, are very much optimized. Relying on them, rather than writing a
    lot of ifs and doing weird data structure manipulations in Python
    itself, is a good approach to try. Try to build those objects outside
    of your main processing loops.

    Cheers

    Douhet-did-suck
     
    DouhetSukd, Dec 13, 2007
    #7
  8. On Wed, 12 Dec 2007 16:44:01 -0800, igor.tatarinov wrote:

    > Here is some of my code. Tell me what's wrong with it :)
    >
    > def loadFile(inputFile, loader):
    > # .zip files don't work with zlib


    Pardon?

    > f = popen('zcat ' + inputFile)
    > for line in f:
    > loader.handleLine(line)


    Do you really need to compress the file? Five million lines isn't a lot.
    It depends on the length of each line, naturally, but I'd be surprised if
    it were more than 100MB.

    > ...
    >
    > In Loader class:
    > def handleLine(self, line):
    > # filter out 'wrong' lines
    > if not self._dataFormat(line): return



    Who knows what the _dataFormat() method does? How complicated is it? Why
    is it a private method?


    > # add a new output record
    > rec = self.result.addRecord()


    Who knows what this does? How complicated it is?


    > for col in self._dataFormat.colFormats:


    Hmmm... a moment ago, _dataFormat seemed to be a method, or at least a
    callable. Now it has grown a colFormats attribute. Complicated and
    confusing.


    > value = parseValue(line, col)
    > rec[col.attr] = value
    >
    > And here is parseValue (will using a hash-based dispatch make it much
    > faster?):


    Possibly, but not enough to reduce 20 minutes to one or two.

    But you know something? Your code looks like a bad case of over-
    generalisation. I assume it's a translation of your C++ code -- no wonder
    it takes an entire minute to process the file! (Oh lord, did I just say
    that???) Object-oriented programming is a useful tool, but sometimes you
    don't need a HyperDispatcherLoaderManagerCreator, you just need a hammer.

    In your earlier post, you gave the data specification:

    "The text file has a fixed format (like a punchcard). The columns contain
    integer, real, and date values. The output files are the same values in
    binary."

    Easy-peasy. First, some test data:


    fp = open('BIG', 'w')
    for i in xrange(5000000):
    anInt = i % 3000
    aBool = ['TRUE', 'YES', '1', 'Y', 'ON',
    'FALSE', 'NO', '0', 'N', 'OFF'][i % 10]
    aFloat = ['1.12', '-3.14', '0.0', '7.42'][i % 4]
    fp.write('%s %s %s\n' % (anInt, aBool, aFloat))
    if i % 45000 == 0:
    # Write a comment and a blank line.
    fp.write('# this is a comment\n \n')

    fp.close()



    Now let's process it:


    import struct

    # Define converters for each type of value to binary.
    def fromBool(s):
    """String to boolean byte."""
    s = s.upper()
    if s in ('TRUE', 'YES', '1', 'Y', 'ON'):
    return struct.pack('b', True)
    elif s in ('FALSE', 'NO', '0', 'N', 'OFF'):
    return struct.pack('b', False)
    else:
    raise ValueError('not a valid boolean')

    def fromInt(s):
    """String to integer bytes."""
    return struct.pack('l', int(s))

    def fromFloat(s):
    """String to floating point bytes."""
    return struct.pack('f', float(s))


    # Assume three fields...
    DEFAULT_FORMAT = [fromInt, fromBool, fromFloat]

    # And three files...
    OUTPUT_FILES = ['ints.out', 'bools.out', 'floats.out']


    def process_line(s, format=DEFAULT_FORMAT):
    s = s.strip()
    fields = s.split() # I assume the fields are whitespace separated
    assert len(fields) == len(format)
    return [f(x) for (x, f) in zip(fields, format)]

    def process_file(infile, outfiles=OUTPUT_FILES):
    out = [open(f, 'wb') for f in outfiles]
    for line in file(infile, 'r'):
    # ignore leading/trailing whitespace and comments
    line = line.strip()
    if line and not line.startswith('#'):
    fields = process_line(line)
    # now write the fields to the files
    for x, fp in zip(fields, out):
    fp.write(x)
    for f in out:
    f.close()



    And now let's use it and see how long it takes:

    >>> import time
    >>> s = time.time(); process_file('BIG'); time.time() - s

    129.58465385437012


    Naturally if your converters are more complex (e.g. date-time), or if you
    have more fields, it will take longer to process, but then I've made no
    effort at all to optimize the code.



    --
    Steven.
     
    Steven D'Aprano, Dec 13, 2007
    #8
  9. Guest

    igor:
    > The fundamental difference is that in C++, I create a single object (a
    > line buffer) that's reused for each input line and column values are
    > extracted straight from that buffer without creating new string
    > objects. In python, new objects must be created and destroyed by the
    > million which must incur serious memory management overhead.


    Python creates indeed many objects (as I think Tim once said "it
    allocates memory at a ferocious rate"), but the management of memory
    is quite efficient. And you may use the JIT Psyco (that's currently
    1000 times more useful than PyPy, despite sadly not being developed
    anymore) that in some situations avoids data copying (example: in
    slices). Python is designed for string processing, and from my
    experience string processing Psyco programs may be faster than similar
    not-optimized-to-death C++/D programs (you can see that manually
    crafted code, or from ShedSkin that's often slower than Psyco during
    string processing). But in every language I know to gain performance
    you need to know the language, and Python isn't C++, so other kinds of
    tricks are necessary.

    The following advice is useful too:

    DouhetSukd:
    >Bottom line: Python built-in data objects, such as dictionaries and

    sets, are very much optimized. Relying on them, rather than writing a
    lot of ifs and doing weird data structure manipulations in Python
    itself, is a good approach to try. Try to build those objects outside
    of your main processing loops.<

    Bye,
    bearophile
     
    , Dec 13, 2007
    #9
  10. Neil Cerutti Guest

    On 2007-12-13, <> wrote:
    > On Dec 12, 4:03 pm, John Machin <> wrote:
    >> Inside your function
    >> [you are doing all this inside a function, not at global level in a
    >> script, aren't you?], do this:
    >> from time import mktime, strptime # do this ONCE
    >> ...
    >> blahblah = int(mktime(strptime(s, "%m%d%y%H%M%S")))
    >>
    >> It would help if you told us what platform, what version of Python,
    >> how much memory, how much swap space, ...
    >>
    >> Cheers,
    >> John

    >
    > I am using a global 'from time import ...'. I will try to do that
    > within the
    > function and see if it makes a difference.
    >
    > The computer I am using has 8G of RAM. It's a Linux dual-core AMD or
    > something like that. Python 2.4
    >
    > Here is some of my code. Tell me what's wrong with it :)
    >
    > def loadFile(inputFile, loader):
    > # .zip files don't work with zlib
    > f = popen('zcat ' + inputFile)
    > for line in f:
    > loader.handleLine(line)
    > ...
    >
    > In Loader class:
    > def handleLine(self, line):
    > # filter out 'wrong' lines
    > if not self._dataFormat(line): return
    >
    > # add a new output record
    > rec = self.result.addRecord()
    >
    > for col in self._dataFormat.colFormats:
    > value = parseValue(line, col)
    > rec[col.attr] = value
    >
    > def parseValue(line, col):
    > s = line[col.start:col.end+1]
    > # no switch in python
    > if col.format == ColumnFormat.DATE:
    > return Format.parseDate(s)
    > if col.format == ColumnFormat.UNSIGNED:
    > return Format.parseUnsigned(s)
    > if col.format == ColumnFormat.STRING:
    > # and-or trick (no x ? y:z in python 2.4)
    > return not col.strip and s or rstrip(s)
    > if col.format == ColumnFormat.BOOLEAN:
    > return s == col.arg and 'Y' or 'N'
    > if col.format == ColumnFormat.PRICE:
    > return Format.parseUnsigned(s)/100.
    >
    > And here is Format.parseDate() as an example:
    > def parseDate(s):
    > # missing (infinite) value ?
    > if s.startswith('999999') or s.startswith('000000'): return -1
    > return int(mktime(strptime(s, "%y%m%d")))


    An inefficient parsing technique is probably to blame. You first
    inspect the line to make sure it is valid, then you inspect it
    (number of column type) times to discover what data type it
    contains, and then you inspect it *again* to finally translate
    it.

    > And here is parseValue (will using a hash-based dispatch make
    > it much faster?):


    Not much.

    You should be able to validate, recognize and translate all in
    one pass. Get pyparsing to help, if need be.

    What does your data look like?

    --
    Neil Cerutti
     
    Neil Cerutti, Dec 13, 2007
    #10
  11. Re: [Python] Re: efficient data loading with Python, is that possiblepossible?

    Neil Cerutti wrote:
    > An inefficient parsing technique is probably to blame. You first
    > inspect the line to make sure it is valid, then you inspect it
    > (number of column type) times to discover what data type it
    > contains, and then you inspect it *again* to finally translate
    > it.
    >

    I was thinking just that. It is much more "pythonic" to simply attempt
    to convert the values in whatever fashion they are supposed to be
    converted, and handle errors in data format by means of exceptions.
    IMO, of course. In the "trivial" case, where there are no errors in the
    data file, this is a heck of a lot faster.

    -- Chris.
     
    Chris Gonnerman, Dec 14, 2007
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Segall
    Replies:
    2
    Views:
    452
    Thomas Kellerer
    Jan 2, 2007
  2. mrhicks
    Replies:
    3
    Views:
    346
    James Dow Allen
    Sep 1, 2004
  3. Replies:
    4
    Views:
    325
  4. edfialk
    Replies:
    0
    Views:
    194
    edfialk
    May 10, 2007
  5. ofir
    Replies:
    0
    Views:
    190
Loading...

Share This Page