Newbie completely confused

Discussion in 'Python' started by Jeroen Hegeman, Sep 21, 2007.

  1. Dear Pythoneers,

    I'm moderately new to python and it got me completely lost already.

    I've got a bunch of large (30MB) txt files containing one 'event' per
    line. I open files after each other, read them line by line and from
    each line build a 'data structure' of a main class (HugeClass)
    containing some simple information as well as several instances of
    some other classes.

    No problem so far, but I noticed that the first file was always
    faster than the others, whereas I would expect it to be slower, if
    anything. Testing with two copies of the same file shows the same
    behaviour.

    Below is a (rather large, I'll explain) chunk of code. I ran this in
    a directory with two test files called 'test_file0.txt' and
    'test_file1.txt', each containing 10k lines of the same information
    as the 'long_line' variable in the code. This shows the following
    timing (consistently) for the little piece of code that reads all
    lines from file:

    ....processing all 2 files found
    --> 1/2: ./test_file0.txt
    Now reading ...
    DEBUG readLines A took 0.093 s
    ....took 8.85717201233 seconds
    --> 2/2: ./test_file0.txt
    Now reading ...
    DEBUG readLines A took 3.917 s
    ....took 12.8725550175 seconds

    So the first time around the file gets read in in ~0.1 seconds, the
    second time around it needs almost four seconds! As far as I can see
    this is related to 'something in memory being copied around' since if
    I replace the 'alternative 1' by the 'alternative 2', basically
    making sure that my classes are not used, reading time the second
    time around drops back to normal (= roughly what it is the first pass).

    I already want to apologise for the size of the code chunk below. I
    know about 'minimal reproducible examples' and such but I found out
    that if I commented out the filling (and thus binding) of some of the
    member variables in the lower-level classes, the problem (sometimes)
    also disappears. That also points to some magic happening in memory?

    I probably mucked something up but I'm really lost as to where. Any
    help would be appreciated.

    The original problem showed up using Python 2.4.3 under linux (Fedora
    Core 1).
    Python 2.3.5 on OS X 10.4.10 (PPC) appears not to show this issue(?).

    Thanks,
    Jeroen

    P.S. Any ideas on optimising the input to the classes would be
    welcome too ;-)

    Jeroen Hegeman
    jeroen DOT hegeman AT gmail DOT com



    ===================Start of code chunk=========================
    #!/usr/bin/env python

    import time
    import sys
    import os
    import gzip
    import pdb

    long_line =
    "1,31905,0,174501,46152419,2117961,143,-1.0000,51,2,-19.9139,42,-19.9140
    ,
    6.6002,0,0,0,46713.1484,2,0.0000,-1,1.4203220606,0.3876158297,147.121017
    4561,147.1284120973,-2,0.0000,-1,1.5887237787,-2.4011900425,-319.7776794
    434,319.7906836817,4,21,0.0000,-1,-0.5672637224,2.2052443027,-43.2842369
    080,43.3440905719,21,0.0000,-1,-0.8540721536,0.0770076364,-22.7033920288
    ,
    22.7195827425,21,0.0000,-1,0.1623233557,0.5845987201,-28.0794525146,28.0
    860084170,21,0.0000,-1,0.1943928897,-0.2195242196,-22.0666370392,22.0685
    899391,6,0.0000,-1,-40.1810989380,-127.0743789673,-104.9231948853,239.74
    36794163,-6,0.0000,-1,43.2013626099,125.0640945435,-67.7339172363,227.17
    53587387,24,0.0000,-1,-57.9123306274,-17.3483123779,-71.8334121704,123.4
    397648033,-24,0.0000,-1,84.0985488892,54.4542312622,-62.4525032043,144.5
    299239704,5,0.0000,-1,17.7312316895,-109.7260665894,-33.0897827148,116.3
    039146130,-5,0.0000,-1,-40.8971862793,70.6098632812,-5.2814140320,82.645
    4347683,4,0.0000,-1,-6.2859884724,-17.9586020410,-58.9464384913,69.40294
    68585,-3,0.0000,-1,-51.6263811588,0.6104701459,-12.8869901896,54.0368221
    571,3,0.0000,-1,16.4690684490,48.0271777511,-51.7867884636,74.5327484701
    ,-4,0.0000,-1,67.6295298338,6.4269350171,-10.6658525467,69.9971834876,7,
    7,1.0345464706e+01,-7.0800781250e+01,-2.0385742187e+01,7.5256346272e
    +01,1.3148,0.0072,0.0072,1.3148,0.0072,0.0072,1.0255,1.0413,0.0,0.0,0.0,
    0.0,-1.0,-4.2383,49.5276,13,0.1537,0.5156,0,0.9982,0.0034,1.0000,7,1,0.9
    566,0.0062,1,0,2,1.2736,1,7.8407,1,0,2,1.2736,1,7.8407,0,0,-1.0,-1.0,5,1
    ,-2.4047853470e+01,4.0832519531e+01,-3.8452150822e+00,4.7851562559e
    +01,1.3383,0.0051,0.0051,1.3383,0.0051,0.0051,0.9340,0.9541,0.0,0.0,0.0,
    0.0,-1.0,-2.4609,21.3916,7,0.1166,0.5977,0,0.9999,0.0052,1.0000,9,1,0.99
    47,0.0063,1,0,2,0.7735,1,74.7937,1,0,2,0.7735,1,74.7937,0,0,-1.0,-1.0,5,
    1,-4.4067382812e+01,2.5634796619e+00,-1.1138916016e+01,4.6203614579e
    +01,1.3533,0.0054,0.0054,1.3533,0.0054,0.0054,1.0486,1.0903,0.0,0.0,0.0,
    0.0,-1.0,-3.9648,31.3733,13,0.1767,0.5508,100,0.9977,0.0040,1.0000,9,1,0
    ..
    0000,0.4349,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,0,0,-1.0
    ,-1.0,0,1,3.7200927734e+01,2.7465817928e+00,-5.5847163200e
    +00,3.7994386563e
    +01,1.3634,0.0062,0.0062,1.6488,0.0385,0.0385,0.7141,0.9013,5.3986899118
    e+00,6.6766492833e-01,-2.3780213181e-01,5.4460399892e
    +00,0.5504,-3.1445,0.7776,9,0.1169,0.7734,0,0.9977,0.0040,1.0000,7,1,0.0
    000,0.1099,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,1,-1,5.38
    93,0.5459,4,1,1.2969970703e+01,3.3203125000e+01,-3.7231445312e
    +01,5.2001951876e
    +01,1.4414,0.0129,0.0129,1.4414,0.0129,0.0129,0.9019,0.7331,0.0,0.0,0.0,
    0.0,-1.0,-10.0195,12.2034,17,0.1922,0.3633,0,0.9774,0.0248,1.0000,6,1,0.
    0000,0.3523,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,0,0,-1.0
    ,-1.0,0,1,-1.6174327135e+00,-7.1411132812e+00,-1.8798828125e
    +01,2.0202637222e
    +01,1.7886,0.0352,0.0352,1.7886,0.0352,0.0352,1.8257,1.2368,0.0,0.0,0.0,
    0.0,-1.0,-17.3438,45.6714,10,0.1529,0.5625,0,0.9898,0.0094,1.0000,3,1,-1
    ..
    0000,10000.0000,0,0,0,-1.0000,0,-1.0000,0,0,0,-1.0000,0,-1.0000,0,0,-1.0
    ,-1.0,-6,0,-5.9204106331e+00,-3.4484868050e+00,-6.5307617187e
    +00,9.6740722971e
    +00,1.6782,0.0326,0.0326,1.6782,0.0326,0.0326,1.0000,1.0000,0.0,0.0,0.0,
    0.0,-1.0,-9.4727,37.3401,13,0.2711,0.2344,100,0.9861,0.0045,1.0000,3,1,-
    1.0000,10000.0000,0,0,0,-1.0000,0,-1.0000,0,0,0,-1.0000,0,-1.0000,0,0,-1
    ..0,-1.0,-6,0"

    ########################################################################
    ###

    class SmallClass:
    def __init__(self):
    return
    def input(self, line, c):
    self.item0 = int(line[c]); c += 1
    self.item1 = float(line[c]); c += 1
    self.item2 = int(line[c]); c += 1
    self.item3 = float(line[c]); c += 1
    self.item4 = float(line[c]); c += 1
    self.item5 = float(line[c]); c += 1
    self.item6 = float(line[c]); c += 1
    return c

    ########################################################################
    ###

    class ModerateClass:
    def __init__(self):
    return
    def __del__(self):
    pass
    return
    def input(self, line, c):

    self.items = {}

    self.item0 = float(line[c]);
    c += 1

    unit1 = SmallClass()
    c = unit1.input(line, c)
    self.items[len(self.items)] = unit1
    unit2 = SmallClass()
    c = unit2.input(line, c)
    self.items[len(self.items)] = unit2

    units_chunk = []
    chunk_size = int(line[c])
    c += 1
    for i in xrange(chunk_size):
    unit = SmallClass()
    c = unit.input(line, c)
    units_chunk.append(unit)
    for i in xrange(10):
    unit = SmallClass()
    c = unit.input(line, c)
    return c

    ########################################################################
    ###

    class LongClass:

    def __init__(self):
    return
    def clear(self):
    return
    def input(self, foo, c):
    self.item0 = float(foo[c]); c += 1
    self.item1 = float(foo[c]); c += 1
    self.item2 = float(foo[c]); c += 1
    self.item3 = float(foo[c]); c += 1
    self.item4 = float(foo[c]); c+=1
    self.item5 = float(foo[c]); c+=1
    self.item6 = float(foo[c]); c+=1
    self.item7 = float(foo[c]); c+=1
    self.item8 = float(foo[c]); c+=1
    self.item9 = float(foo[c]); c+=1
    self.item10 = float(foo[c]); c+=1
    self.item11 = float(foo[c]); c+=1
    self.item12 = float(foo[c]); c += 1
    self.item13 = float(foo[c]); c += 1
    self.item14 = float(foo[c]); c += 1
    self.item15 = float(foo[c]); c += 1
    self.item16 = float(foo[c]); c+=1
    self.item17 = float(foo[c]); c+=1
    self.item18 = float(foo[c]); c+=1
    self.item19 = int(foo[c]); c+=1
    self.item20 = float(foo[c]); c+=1
    self.item21 = float(foo[c]); c+=1
    self.item22 = int(foo[c]); c+=1
    self.item23 = float(foo[c]); c += 1
    self.item24 = float(foo[c]); c += 1
    self.item25 = float(foo[c]); c+=1
    self.item26 = int(foo[c]); c+=1
    self.item27 = bool(int(foo[c])); c+=1
    self.item28 = float(foo[c]); c+=1
    self.item29 = float(foo[c]); c+=1
    self.item30 = (foo[c] == "1"); c += 1
    self.item31 = (foo[c] == "1"); c += 1
    self.item32 = float(foo[c]); c += 1
    self.item33 = float(foo[c]); c += 1
    self.item34 = int(foo[c]); c += 1
    self.item35 = float(foo[c]); c += 1
    self.item36 = (foo[c] == "1"); c+=1
    self.item37 = (foo[c] == "1"); c+=1
    self.item38 = float(foo[c]); c += 1
    self.item39 = float(foo[c]); c += 1
    self.item40 = int(foo[c]); c += 1
    self.item41 = float(foo[c]); c += 1
    self.item42 = (foo[c] == "1"); c+=1
    self.item43 = float(foo[c]); c+=1
    self.item44 = float(foo[c]); c+=1
    self.item45 = float(foo[c]); c += 1
    self.item46 = int(foo[c]); c+=1
    self.item47 = bool(int(foo[c])); c+=1
    return c

    ########################################################################
    ###

    class HugeClass:
    def __init__(self,line):
    self.clear()
    self.input(line)
    return
    def __del__(self):
    del self.B4v
    return
    def clear(self):
    self.long_classes = {}
    self.B4v={}
    return
    def input(self, line):

    try:
    foo = line.strip().split(',')
    c = 0

    self.asciiVersion = float(foo[c])
    c += 1

    self.item0 = foo[c]; c += 1
    self.item1 = (self.item0 != "0")

    self.item2 = (foo[c] == "1"); c += 1

    self.item3=int(foo[c]); c+=1
    self.item4=int(foo[c]); c+=1
    self.item5=int(foo[c]); c+=1
    self.item6=int(foo[c]); c += 1
    self.item7=float(foo[c]); c+=1

    self.item8 = foo[c]; c += 1
    bit_item = int(self.item8)
    self.item9 = bool(bit_item & 2048)
    self.item10 = bool(bit_item & 1024)
    self.item11 = bool(bit_item & 512)
    self.item12 = bool(bit_item & 256)
    self.item13 = bool(bit_item & 128)
    self.item14 = bool(bit_item & 64)
    self.item15 = bool(bit_item & 32)
    self.item16 = bool(bit_item & 16)
    self.item17 = bool(bit_item & 8)
    self.item18 = bool(bit_item & 4)
    self.item19 = bool(bit_item & 2)
    self.item20 = bool(bit_item & 1)

    self.item21 = int(foo[c]); c+=1
    self.item22 = float(foo[c]); c+=1
    self.item23 = int(foo[c]); c+=1
    self.item24 = float(foo[c]); c+=1

    self.item25 = float(foo[c]); c+=1

    self.item26 = foo[c]; c+=1
    self.item27 = int(foo[c]); c+=1
    self.item28 = int(foo[c]); c+=1

    self.item29 = ModerateClass()
    c = self.item29.input(foo, c)

    self.item30 = int(foo[c]); c+=1
    self.item31 = int(foo[c]); c+=1

    for i in xrange(self.item31):
    unit = LongClass()
    c = unit.input(foo, c)
    self.long_classes[len(self.long_classes)] = unit

    assert(c == len(foo)), "ERROR We did not read the whole
    line!!!"

    except (ValueError,IndexError), msg:
    print >> sys.stderr, \
    "ERROR Trouble reading line: `%(msg)s'" % vars()
    self.clear()
    return
    return

    ########################################################################
    ###

    def readLines(f):
    DATA = []
    f.seek(0)

    time_a = time.time()
    for i in f:
    DATA.append(i)
    time_b = time.time()

    time_spent_reading = time_b - time_a
    print "DEBUG readLines took %.3f s" % time_spent_reading

    return DATA

    ########################################################################
    ###

    def ReadClasses(filename):

    print 'Now reading ...'

    built_classes = {}

    # Read lines from file
    in_file = open(filename, 'r')
    LINES = readLines(in_file)
    in_file.close()

    # and interpret them.
    for i in LINES:
    ## This is alternative 1.
    built_classes[len(built_classes)] = HugeClass(long_line)
    ## The next line is alternative 2.
    ## built_classes[len(built_classes)] = long_line

    del LINES

    return

    ########################################################################
    ###

    def ProcessList():

    input_files = ["./test_file0.txt",
    "./test_file0.txt"]

    # Loop over all files that we found.
    nfiles = len(input_files)
    file_index = 0
    for i in input_files:
    print "--> %i/%i: %s" % (file_index+1, nfiles, i)
    ReadClasses(i)
    file_index += 1

    return

    ########################################################################
    ###

    if __name__ == "__main__":
    ProcessList()

    sys.exit(0)

    ########################################################################
    ###
     
    Jeroen Hegeman, Sep 21, 2007
    #1
    1. Advertisements

  2. Jeroen Hegeman

    John Machin Guest

    Your code does NOT include any statements that could have produced the
    above line of output -- IOW, you have not posted the code that you
    actually ran. Your code is already needlessly monstrously large.
    That's two strikes against anyone bothering to try to nut out what's
    going wrong, if indeed anything is going wrong.

    [snip]
    And Python 2.5.1 does what? Strike 3.
    1. What is the point of having a do-nothing __init__ method? I'd
    suggest making the __init__method do the "input".

    2. See below

    [snip]
    at global level:

    converters = [float] * 48
    cvlist = [
    (int, (19, 22, 26, 34, 40, 46)),
    (lambda z: bool(int(z)), (27, 47)),
    (lambda z: z == "1", (30, 31, 36, 37, 42)),
    ]
    for func, indexes in cvlist:
    for x in indexes:
    converters[x] = func
    enumerated_converters = list(enumerate(converters))

    Then:

    def input(self, foo, c):
    self.item = [func(foo[c+x]) for x, func in
    enumerated_converters]
    return c + 48

    which requires you refer to obj.item[19] instead of obj.item19

    If you *must* use item19 etc, then try this:

    for x, func in enumerated_converters:
    setattr(self, "item%d" % x, func(foo[c+x]))

    You could also (shock, horror) use meaningful names for the
    attributes ... include a list of attribute names in the global stuff,
    and put the relevant name in as the 2nd arg of setattr() instead of
    itemxx.

    For handling the bit extraction stuff, either

    (a) conversion functions have a 2nd arg which defaults to None and
    whose usage depends on the function itself ... would be mask or bit
    position (or could be e.g. a scale factor for implied-decimal-point
    input)

    or

    (b) do a loop over the bit positions

    HTH,
    John
     
    John Machin, Sep 22, 2007
    #2
    1. Advertisements

  3. Jeroen Hegeman schreef:
    (First, I had to add timing code to ReadClasses: the code you posted
    doesn't include them, and only shows timings for ReadLines.)

    Your program uses quite a bit of memory. I guess it gets harder and
    harder to allocate the required amounts of memory.

    If I change this line in ReadClasses:

    built_classes[len(built_classes)] = HugeClass(long_line)

    to

    dummy = HugeClass(long_line)

    then both times the files are read and your data structures are built,
    but after each run the data structure is freed. The result is that both
    runs are equally fast.

    Also, if I run the first version (without the dummy) on a computer with
    a bit more memory (1 GiB), it seems there is no problem allocating
    memory: both runs are equally fast.

    I'm not sure how to speed things up here... you're doing much processing
    on a lot of small chunks of data. I have a number of observations and
    possible improvements though, and some might even speed things up a bit.

    You read the files, but don't use the contents; instead you use
    long_line over and over. I suppose you do that because this is a test,
    not your actual code?

    __init__() with nothing (or only return) in it is not useful; better to
    just leave it out.


    You have a number of return statements that don't do anything (i.e. they
    return nothing (None actually) at the end of the function). A function
    without return automatically returns None at the end, so it's better to
    leave them out.

    Similarly you don't need to call sys.exit(): the script will terminate
    anyway if it reaches the end. Better leave it out.


    LongClass.clear() doesn't do anything and isn't called anyway; leave it out.


    ModerateClass.__del__() doesn't do anything either. I'm not sure how it
    affects what happens if ModerateClass gets freed, but I suggest you
    don't start messing with __del__() until you have more Python knowledge
    and experience. I'm not sure why you think you need to implement that
    method.
    The same goes for HugeClass.__del__(). It does delete self.B4v, but the
    default behavior will do that too. Again, I don't get why you want to
    override the default behavior.


    In a number of cases, you use a dict like this:

    built_classes = {}
    for i in LINES:
    built_classes[len(built_classes)] = ...

    So you're using the indices 0, 1, 2, ... as the keys. That's not what
    dictionaries are made for; lists are much better for that:

    built_classes = []
    for i in LINES:
    built_classes.append(...)


    HugeClass.B4v isn't used, so you can safely remove it.


    Your readLines() function reads a whole file into memory. If you're
    working with large files, that's not such a good idea. It's better to
    load one line at a time into memory and work on that. I would even
    completely remove readLines() and restructure ReadClasses() like this:

    def ReadClasses(filename):
    print 'Now reading ...'

    built_classes = []

    # Open file
    in_file = open(filename, 'r')

    # Read lines and interpret them.
    time_a = time.time()
    for i in in_file:
    ## This is alternative 1.
    built_classes.append(HugeClass(long_line))
    ## The next line is alternative 2.
    ## built_classes[len(built_classes)] = long_line

    in_file.close()
    time_b = time.time()
    print "DEBUG readClasses took %.3f s" % (time_b - time_a)

    Personally I only use 'i' for integer indices (as in 'for i in
    range(10)'); for other use I prefer more descriptive names:

    for line in in_file: ...

    But I guess that's up to personal preference. Also you used LINES to
    store the file contents; the convention is that names with all capitals
    are used for constants, not for things that change.


    In ProcessList(), you keep the index in a separate variable. Python has
    a trick so you don't have to do that yourself:

    nfiles = len(input_files)
    for file_index, i in enumerate(input_files):
    print "--> %i/%i: %s" % (file_index + 1, nfiles, i)
    ReadClasses(i)


    Instead of item0, item1, ... , it's generally better to use a list, so
    you can use item[0], item[1], ...


    And finally, John Machin's suggestion looks like a good way to
    restructure that long sequence of conversions and assignments in HugeClass.


    --
    The saddest aspect of life right now is that science gathers knowledge
    faster than society gathers wisdom.
    -- Isaac Asimov

    Roel Schroeven
     
    Roel Schroeven, Sep 22, 2007
    #3
  4. Thanks for the comments,
    Well, I guess there could be something in that, but why is there a
    significant increase after the first time? And after that, single-
    trip time pretty much flattens out. No more obvious increases.
    Isnt't the 'del LINES' supposed to achieve the same thing? And
    really, reading 30MB files should not be such a problem, right? (I'm
    also running with 1GB of RAM.)
    Cool thanks, let's go over them.
    Yeah ;-) (Do I notice a lack of trust in the responses I get? Should
    I not mention 'newbie'?)

    Let's get a couple of things out of the way:
    - I do know about meaningful variable names and case-conventions,
    but ... First of all I also have to live with inherited code (I don't
    like people shouting in their code either), and secondly (all the
    itemx) most of these members normally _have_ descriptive names but
    I'm not supposed to copy-paste the original code to any newsgroups.
    - I also know that a plain 'return' in python does not do anything
    but I happen to like them. Same holds for the sys.exit() call.
    - The __init__ methods normally actually do something: they
    initialise some member variables to meaningful values (by calling the
    clear() method, actually).
    - The __clear__ method normally brings objects back into a well-
    defined 'empty' state.
    - The __del__ methods are actually needed in this case (well, in the
    _real_ code anyway). The python code loads a module written in C++
    and some of the member variables actually point to C++ objects
    created dynamically, so one actually has to call their destructors
    before unbinding the python var.

    I tried to get things down to as small as possible, but when I found
    out that the size of the classes seems to contribute to the issue
    (removing enough member variables will bring you to a point where all
    of a sudden the speed increases a factor ten, there seems to be some
    breakpoint depending on the size of the classes) I could not simply
    remove all members but had to give them funky names. I kept the main
    structure of things, though, to see if that would solicit comments.
    (And it did...)
    Yeah, I inherited that part...
    Actually, part of what I removed was the real reason why readLines()
    is there at all: it reads files in blocks of (at most) some_number
    lines, and keeps track of the line offset in the file. I kept this
    structure hoping that someone would point out something obvious like
    some internal buffer going out of scope or whatever.

    All right, thanks for the tips. I guess the issue itself is still
    open, though.

    Cheers,
    Jeroen

    Jeroen Hegeman
    jeroen DOT hegeman AT gmail DOT com

    WARNING: This message may contain classified information. Immediately
    burn this message after reading.
     
    Jeroen Hegeman, Sep 24, 2007
    #4
  5. Oh my, I must have cleaned it up a bit too much, hoping that people
    would focus on the issue instead of the formatting of the output
    strings! Did you miss your morning coffee???
    Which I realised and apologised for beforehand.
    Hmm, I must have missed where it said that you can only ask for help
    if you're using the latest version... In case you're wondering, 2.5.1
    is not _really_ that wide-spread as most of the older versions.
    Now that sounds more useful. I'll give that a try.

    Thanks,
    Jeroen

    Jeroen Hegeman
    jeroen DOT hegeman AT gmail DOT com

    WARNING: This message may contain classified information. Immediately
    burn this message after reading.
     
    Jeroen Hegeman, Sep 24, 2007
    #5
  6. Two comments,
    this here (and your code in general) is mind boggling and not in a
    good way,

    as for you original question, I don't think that reading in files of
    the size you mention can cause any substantial problems, I think the
    problem is somewhere else,

    you can run the code below to see that the read times are unaffected
    by the order of processing

    ----------

    import timeit

    # make a big file
    NUM= 10**5
    fp = open('bigfile.txt', 'wt')
    longline = ' ABC '* 60 + '\n'
    for count in xrange( NUM ):
    fp.write( longline )
    fp.close()

    setup1 = """
    def readLines():
    data = []
    for line in file('bigfile.txt'):
    data.append( line )
    return data
    """

    stmt1 = """
    data = readLines()
    """

    stmt2 = """
    data = readLines()
    data = readLines()
    """

    stmt3 = """
    data = file('bigfile.txt').readlines()
    """

    def run( setup, stmt, N=5 ):
    t = timeit.Timer(stmt=stmt, setup=setup)
    msec = 1000 * t.timeit(number=N)/N
    print "%f msec/pass" % msec

    if __name__ == '__main__':
    for stmt in (stmt1, stmt2, stmt3):
    run(setup=setup1, stmt=stmt)
     
    Istvan Albert, Sep 24, 2007
    #6
  7. Jeroen Hegeman

    John Machin Guest

    The difference was not a formatting difference; it was complete
    absence of a statement, raising the question of what other non-obvious
    differences there might be.

    You miss the point: if it is obvious that the posted code did not
    produce the posted output (common when newbies are thrashing around
    trying to solve a problem), some of the audience may not bother trying
    to help with the main issue -- they may attempt to help with side
    issues (as I did with the fugly code bloat) or just ignore you
    altogether.
    An apology does not change the fact that the code was needlesly large
    (AND needed careful post-linefolding reformatting just to make it
    runnable) and so some may not have bothered to read it.
    You missed the point again: that your problem may be fixed in a later
    version.
    I wasn't wondering. I know. I maintain a package (xlrd) which works on
    Python 2.5 all the way back to 2.1. It occasionally has possibly
    similar "second iteration goes funny" issues (e.g. when reading 120MB
    Excel spreadsheet files one after the other). You mention that
    removing some attributes from a class may make your code stop
    exhibiting cliff-face behaviour. If you can produce two versions of
    your code that actually demonstrate the abrupt change, I'd be quite
    interested in digging into it, to our possible mutual benefit.
    I'm glad you found something possibly more useful in my posting :)

    Cheers,
    John
     
    John Machin, Sep 24, 2007
    #7
  8. Jeroen Hegeman schreef:
    Sorry, I have no idea.
    'del LINES' deletes the lines that are read from the file, but not all
    of your data structures that you created out of them.
    Now, indeed, reading 30 MB files should not be a problem. And I am
    confident that just reading the data is not a problem. To make sure I
    created a simple test:

    import time

    input_files = ["./test_file0.txt", "./test_file1.txt"]

    total_start = time.time()
    data = {}
    for input_fn in input_files:
    file_start = time.time()
    f = file(input_fn, 'r')
    data[input_fn] = f.read()
    f.close()
    file_done = time.time()
    print '%s: %f to read %d bytes' % (input_fn, file_done -
    file_start, len(data))
    total_done = time.time()
    print 'all done in %f' % (total_done - total_start)


    When I run that with test_file0.txt and test_file1.txt as you described
    (each 30 MB), I get this output:

    ../test_file0.txt: 0.260000 to read 1 bytes
    ../test_file1.txt: 0.251000 to read 2 bytes
    all done in 0.521000

    Therefore I think the problem is not in reading the data, but in
    processing it and creating the data structures.
    I didn't mean to attack you; it's just that the program reads 30 MB of
    data, twice, but doesn't do anything with it. It only uses the data that
    was stored in long_lines, and which never is replaced. That is very
    strange for real code, but as a test it can have it's uses. That's why I
    asked.
    That sounds a bit weird to me; I would think such explicit memory
    management belongs in the C++ code instead of in the Python code, but I
    must admit that I know next to nothing about extending Python so I
    assume you are right.
    I'm afraid so. Sorry I can't help.

    One thing that helped me in the past to speed up input is using memory
    mapped I/O instead of stream I/O. But that was in C++ on Windows; I
    don't know if the same applies to Python on Linux.

    --
    The saddest aspect of life right now is that science gathers knowledge
    faster than society gathers wisdom.
    -- Isaac Asimov

    Roel Schroeven
     
    Roel Schroeven, Sep 25, 2007
    #8
  9. Roel Schroeven schreef:
    .... that should of course be len(data[input_fn]) ...
    .... and then that becomes:

    ../test_file0.txt: 0.290000 to read 33170000 bytes
    ../test_file1.txt: 0.231000 to read 33170000 bytes
    all done in 0.521000


    --
    The saddest aspect of life right now is that science gathers knowledge
    faster than society gathers wisdom.
    -- Isaac Asimov

    Roel Schroeven
     
    Roel Schroeven, Sep 25, 2007
    #9
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.