Newbie completely confused

Jeroen Hegeman · Sep 21, 2007

Dear Pythoneers,

I'm moderately new to python and it got me completely lost already.

I've got a bunch of large (30MB) txt files containing one 'event' per
line. I open files after each other, read them line by line and from
each line build a 'data structure' of a main class (HugeClass)
containing some simple information as well as several instances of
some other classes.

No problem so far, but I noticed that the first file was always
faster than the others, whereas I would expect it to be slower, if
anything. Testing with two copies of the same file shows the same
behaviour.

Below is a (rather large, I'll explain) chunk of code. I ran this in
a directory with two test files called 'test_file0.txt' and
'test_file1.txt', each containing 10k lines of the same information
as the 'long_line' variable in the code. This shows the following
timing (consistently) for the little piece of code that reads all
lines from file:

....processing all 2 files found
--> 1/2: ./test_file0.txt
Now reading ...
DEBUG readLines A took 0.093 s
....took 8.85717201233 seconds
--> 2/2: ./test_file0.txt
Now reading ...
DEBUG readLines A took 3.917 s
....took 12.8725550175 seconds

So the first time around the file gets read in in ~0.1 seconds, the
second time around it needs almost four seconds! As far as I can see
this is related to 'something in memory being copied around' since if
I replace the 'alternative 1' by the 'alternative 2', basically
making sure that my classes are not used, reading time the second
time around drops back to normal (= roughly what it is the first pass).

I already want to apologise for the size of the code chunk below. I
know about 'minimal reproducible examples' and such but I found out
that if I commented out the filling (and thus binding) of some of the
member variables in the lower-level classes, the problem (sometimes)
also disappears. That also points to some magic happening in memory?

I probably mucked something up but I'm really lost as to where. Any
help would be appreciated.

The original problem showed up using Python 2.4.3 under linux (Fedora
Core 1).
Python 2.3.5 on OS X 10.4.10 (PPC) appears not to show this issue(?).

Thanks,
Jeroen

P.S. Any ideas on optimising the input to the classes would be
welcome too ;-)

Jeroen Hegeman
jeroen DOT hegeman AT gmail DOT com

===================Start of code chunk=========================
#!/usr/bin/env python

import time
import sys
import os
import gzip
import pdb

long_line =
"1,31905,0,174501,46152419,2117961,143,-1.0000,51,2,-19.9139,42,-19.9140
,
6.6002,0,0,0,46713.1484,2,0.0000,-1,1.4203220606,0.3876158297,147.121017
4561,147.1284120973,-2,0.0000,-1,1.5887237787,-2.4011900425,-319.7776794
434,319.7906836817,4,21,0.0000,-1,-0.5672637224,2.2052443027,-43.2842369
080,43.3440905719,21,0.0000,-1,-0.8540721536,0.0770076364,-22.7033920288
,
22.7195827425,21,0.0000,-1,0.1623233557,0.5845987201,-28.0794525146,28.0
860084170,21,0.0000,-1,0.1943928897,-0.2195242196,-22.0666370392,22.0685
899391,6,0.0000,-1,-40.1810989380,-127.0743789673,-104.9231948853,239.74
36794163,-6,0.0000,-1,43.2013626099,125.0640945435,-67.7339172363,227.17
53587387,24,0.0000,-1,-57.9123306274,-17.3483123779,-71.8334121704,123.4
397648033,-24,0.0000,-1,84.0985488892,54.4542312622,-62.4525032043,144.5
299239704,5,0.0000,-1,17.7312316895,-109.7260665894,-33.0897827148,116.3
039146130,-5,0.0000,-1,-40.8971862793,70.6098632812,-5.2814140320,82.645
4347683,4,0.0000,-1,-6.2859884724,-17.9586020410,-58.9464384913,69.40294
68585,-3,0.0000,-1,-51.6263811588,0.6104701459,-12.8869901896,54.0368221
571,3,0.0000,-1,16.4690684490,48.0271777511,-51.7867884636,74.5327484701
,-4,0.0000,-1,67.6295298338,6.4269350171,-10.6658525467,69.9971834876,7,
7,1.0345464706e+01,-7.0800781250e+01,-2.0385742187e+01,7.5256346272e
+01,1.3148,0.0072,0.0072,1.3148,0.0072,0.0072,1.0255,1.0413,0.0,0.0,0.0,
0.0,-1.0,-4.2383,49.5276,13,0.1537,0.5156,0,0.9982,0.0034,1.0000,7,1,0.9
566,0.0062,1,0,2,1.2736,1,7.8407,1,0,2,1.2736,1,7.8407,0,0,-1.0,-1.0,5,1
,-2.4047853470e+01,4.0832519531e+01,-3.8452150822e+00,4.7851562559e
+01,1.3383,0.0051,0.0051,1.3383,0.0051,0.0051,0.9340,0.9541,0.0,0.0,0.0,
0.0,-1.0,-2.4609,21.3916,7,0.1166,0.5977,0,0.9999,0.0052,1.0000,9,1,0.99
47,0.0063,1,0,2,0.7735,1,74.7937,1,0,2,0.7735,1,74.7937,0,0,-1.0,-1.0,5,
1,-4.4067382812e+01,2.5634796619e+00,-1.1138916016e+01,4.6203614579e
+01,1.3533,0.0054,0.0054,1.3533,0.0054,0.0054,1.0486,1.0903,0.0,0.0,0.0,
0.0,-1.0,-3.9648,31.3733,13,0.1767,0.5508,100,0.9977,0.0040,1.0000,9,1,0
..
0000,0.4349,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,0,0,-1.0
,-1.0,0,1,3.7200927734e+01,2.7465817928e+00,-5.5847163200e
+00,3.7994386563e
+01,1.3634,0.0062,0.0062,1.6488,0.0385,0.0385,0.7141,0.9013,5.3986899118
e+00,6.6766492833e-01,-2.3780213181e-01,5.4460399892e
+00,0.5504,-3.1445,0.7776,9,0.1169,0.7734,0,0.9977,0.0040,1.0000,7,1,0.0
000,0.1099,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,1,-1,5.38
93,0.5459,4,1,1.2969970703e+01,3.3203125000e+01,-3.7231445312e
+01,5.2001951876e
+01,1.4414,0.0129,0.0129,1.4414,0.0129,0.0129,0.9019,0.7331,0.0,0.0,0.0,
0.0,-1.0,-10.0195,12.2034,17,0.1922,0.3633,0,0.9774,0.0248,1.0000,6,1,0.
0000,0.3523,0,0,0,0.0000,0,-1000.0000,0,0,0,0.0000,0,-1000.0000,0,0,-1.0
,-1.0,0,1,-1.6174327135e+00,-7.1411132812e+00,-1.8798828125e
+01,2.0202637222e
+01,1.7886,0.0352,0.0352,1.7886,0.0352,0.0352,1.8257,1.2368,0.0,0.0,0.0,
0.0,-1.0,-17.3438,45.6714,10,0.1529,0.5625,0,0.9898,0.0094,1.0000,3,1,-1
..
0000,10000.0000,0,0,0,-1.0000,0,-1.0000,0,0,0,-1.0000,0,-1.0000,0,0,-1.0
,-1.0,-6,0,-5.9204106331e+00,-3.4484868050e+00,-6.5307617187e
+00,9.6740722971e
+00,1.6782,0.0326,0.0326,1.6782,0.0326,0.0326,1.0000,1.0000,0.0,0.0,0.0,
0.0,-1.0,-9.4727,37.3401,13,0.2711,0.2344,100,0.9861,0.0045,1.0000,3,1,-
1.0000,10000.0000,0,0,0,-1.0000,0,-1.0000,0,0,0,-1.0000,0,-1.0000,0,0,-1
..0,-1.0,-6,0"

########################################################################
###

class SmallClass:
def __init__(self):
return
def input(self, line, c):
self.item0 = int(line[c]); c += 1
self.item1 = float(line[c]); c += 1
self.item2 = int(line[c]); c += 1
self.item3 = float(line[c]); c += 1
self.item4 = float(line[c]); c += 1
self.item5 = float(line[c]); c += 1
self.item6 = float(line[c]); c += 1
return c

########################################################################
###

class ModerateClass:
def __init__(self):
return
def __del__(self):
pass
return
def input(self, line, c):

self.items = {}

self.item0 = float(line[c]);
c += 1

unit1 = SmallClass()
c = unit1.input(line, c)
self.items[len(self.items)] = unit1
unit2 = SmallClass()
c = unit2.input(line, c)
self.items[len(self.items)] = unit2

units_chunk = []
chunk_size = int(line[c])
c += 1
for i in xrange(chunk_size):
unit = SmallClass()
c = unit.input(line, c)
units_chunk.append(unit)
for i in xrange(10):
unit = SmallClass()
c = unit.input(line, c)
return c

########################################################################
###

class LongClass:

def __init__(self):
return
def clear(self):
return
def input(self, foo, c):
self.item0 = float(foo[c]); c += 1
self.item1 = float(foo[c]); c += 1
self.item2 = float(foo[c]); c += 1
self.item3 = float(foo[c]); c += 1
self.item4 = float(foo[c]); c+=1
self.item5 = float(foo[c]); c+=1
self.item6 = float(foo[c]); c+=1
self.item7 = float(foo[c]); c+=1
self.item8 = float(foo[c]); c+=1
self.item9 = float(foo[c]); c+=1
self.item10 = float(foo[c]); c+=1
self.item11 = float(foo[c]); c+=1
self.item12 = float(foo[c]); c += 1
self.item13 = float(foo[c]); c += 1
self.item14 = float(foo[c]); c += 1
self.item15 = float(foo[c]); c += 1
self.item16 = float(foo[c]); c+=1
self.item17 = float(foo[c]); c+=1
self.item18 = float(foo[c]); c+=1
self.item19 = int(foo[c]); c+=1
self.item20 = float(foo[c]); c+=1
self.item21 = float(foo[c]); c+=1
self.item22 = int(foo[c]); c+=1
self.item23 = float(foo[c]); c += 1
self.item24 = float(foo[c]); c += 1
self.item25 = float(foo[c]); c+=1
self.item26 = int(foo[c]); c+=1
self.item27 = bool(int(foo[c])); c+=1
self.item28 = float(foo[c]); c+=1
self.item29 = float(foo[c]); c+=1
self.item30 = (foo[c] == "1"); c += 1
self.item31 = (foo[c] == "1"); c += 1
self.item32 = float(foo[c]); c += 1
self.item33 = float(foo[c]); c += 1
self.item34 = int(foo[c]); c += 1
self.item35 = float(foo[c]); c += 1
self.item36 = (foo[c] == "1"); c+=1
self.item37 = (foo[c] == "1"); c+=1
self.item38 = float(foo[c]); c += 1
self.item39 = float(foo[c]); c += 1
self.item40 = int(foo[c]); c += 1
self.item41 = float(foo[c]); c += 1
self.item42 = (foo[c] == "1"); c+=1
self.item43 = float(foo[c]); c+=1
self.item44 = float(foo[c]); c+=1
self.item45 = float(foo[c]); c += 1
self.item46 = int(foo[c]); c+=1
self.item47 = bool(int(foo[c])); c+=1
return c

########################################################################
###

class HugeClass:
def __init__(self,line):
self.clear()
self.input(line)
return
def __del__(self):
del self.B4v
return
def clear(self):
self.long_classes = {}
self.B4v={}
return
def input(self, line):

try:
foo = line.strip().split(',')
c = 0

self.asciiVersion = float(foo[c])
c += 1

self.item0 = foo[c]; c += 1
self.item1 = (self.item0 != "0")

self.item2 = (foo[c] == "1"); c += 1

self.item3=int(foo[c]); c+=1
self.item4=int(foo[c]); c+=1
self.item5=int(foo[c]); c+=1
self.item6=int(foo[c]); c += 1
self.item7=float(foo[c]); c+=1

self.item8 = foo[c]; c += 1
bit_item = int(self.item8)
self.item9 = bool(bit_item & 2048)
self.item10 = bool(bit_item & 1024)
self.item11 = bool(bit_item & 512)
self.item12 = bool(bit_item & 256)
self.item13 = bool(bit_item & 128)
self.item14 = bool(bit_item & 64)
self.item15 = bool(bit_item & 32)
self.item16 = bool(bit_item & 16)
self.item17 = bool(bit_item & 8)
self.item18 = bool(bit_item & 4)
self.item19 = bool(bit_item & 2)
self.item20 = bool(bit_item & 1)

self.item21 = int(foo[c]); c+=1
self.item22 = float(foo[c]); c+=1
self.item23 = int(foo[c]); c+=1
self.item24 = float(foo[c]); c+=1

self.item25 = float(foo[c]); c+=1

self.item26 = foo[c]; c+=1
self.item27 = int(foo[c]); c+=1
self.item28 = int(foo[c]); c+=1

self.item29 = ModerateClass()
c = self.item29.input(foo, c)

self.item30 = int(foo[c]); c+=1
self.item31 = int(foo[c]); c+=1

for i in xrange(self.item31):
unit = LongClass()
c = unit.input(foo, c)
self.long_classes[len(self.long_classes)] = unit

assert(c == len(foo)), "ERROR We did not read the whole
line!!!"

except (ValueError,IndexError), msg:
print >> sys.stderr, \
"ERROR Trouble reading line: `%(msg)s'" % vars()
self.clear()
return
return

########################################################################
###

def readLines(f):
DATA = []
f.seek(0)

time_a = time.time()
for i in f:
DATA.append(i)
time_b = time.time()

time_spent_reading = time_b - time_a
print "DEBUG readLines took %.3f s" % time_spent_reading

return DATA

########################################################################
###

def ReadClasses(filename):

print 'Now reading ...'

built_classes = {}

# Read lines from file
in_file = open(filename, 'r')
LINES = readLines(in_file)
in_file.close()

# and interpret them.
for i in LINES:
## This is alternative 1.
built_classes[len(built_classes)] = HugeClass(long_line)
## The next line is alternative 2.
## built_classes[len(built_classes)] = long_line

del LINES

return

########################################################################
###

def ProcessList():

input_files = ["./test_file0.txt",
"./test_file0.txt"]

# Loop over all files that we found.
nfiles = len(input_files)
file_index = 0
for i in input_files:
print "--> %i/%i: %s" % (file_index+1, nfiles, i)
ReadClasses(i)
file_index += 1

return

########################################################################
###

if __name__ == "__main__":
ProcessList()

sys.exit(0)

########################################################################
###

John Machin · Sep 22, 2007

On Sep 22 said:
...processing all 2 files found
--> 1/2: ./test_file0.txt
Now reading ...
DEBUG readLines A took 0.093 s
...took 8.85717201233 seconds

Your code does NOT include any statements that could have produced the
above line of output -- IOW, you have not posted the code that you
actually ran. Your code is already needlessly monstrously large.
That's two strikes against anyone bothering to try to nut out what's
going wrong, if indeed anything is going wrong.

[snip]

The original problem showed up using Python 2.4.3 under linux (Fedora
Core 1).
Python 2.3.5 on OS X 10.4.10 (PPC) appears not to show this issue(?).

And Python 2.5.1 does what? Strike 3.

P.S. Any ideas on optimising the input to the classes would be
welcome too ;-)

1. What is the point of having a do-nothing __init__ method? I'd
suggest making the __init__method do the "input".

2. See below

[snip]

class LongClass:

def __init__(self):
return
def clear(self):
return
def input(self, foo, c):
self.item0 = float(foo[c]); c += 1
self.item1 = float(foo[c]); c += 1 [multiple snips ahead]
self.item18 = float(foo[c]); c+=1
self.item19 = int(foo[c]); c+=1
self.item20 = float(foo[c]); c+=1
self.item27 = bool(int(foo[c])); c+=1
self.item30 = (foo[c] == "1"); c += 1
self.item31 = (foo[c] == "1"); c += 1
self.item47 = bool(int(foo[c])); c+=1
return c

at global level:

converters = [float] * 48
cvlist = [
(int, (19, 22, 26, 34, 40, 46)),
(lambda z: bool(int(z)), (27, 47)),
(lambda z: z == "1", (30, 31, 36, 37, 42)),
]
for func, indexes in cvlist:
for x in indexes:
converters[x] = func
enumerated_converters = list(enumerate(converters))

Then:

def input(self, foo, c):
self.item = [func(foo[c+x]) for x, func in
enumerated_converters]
return c + 48

which requires you refer to obj.item[19] instead of obj.item19

If you *must* use item19 etc, then try this:

for x, func in enumerated_converters:
setattr(self, "item%d" % x, func(foo[c+x]))

You could also (shock, horror) use meaningful names for the
attributes ... include a list of attribute names in the global stuff,
and put the relevant name in as the 2nd arg of setattr() instead of
itemxx.

For handling the bit extraction stuff, either

(a) conversion functions have a 2nd arg which defaults to None and
whose usage depends on the function itself ... would be mask or bit
position (or could be e.g. a scale factor for implied-decimal-point
input)

or

(b) do a loop over the bit positions

HTH,
John

Roel Schroeven · Sep 22, 2007

Jeroen Hegeman schreef:

...processing all 2 files found
--> 1/2: ./test_file0.txt
Now reading ...
DEBUG readLines A took 0.093 s
...took 8.85717201233 seconds
--> 2/2: ./test_file0.txt
Now reading ...
DEBUG readLines A took 3.917 s
...took 12.8725550175 seconds

So the first time around the file gets read in in ~0.1 seconds, the
second time around it needs almost four seconds! As far as I can see
this is related to 'something in memory being copied around' since if
I replace the 'alternative 1' by the 'alternative 2', basically
making sure that my classes are not used, reading time the second
time around drops back to normal (= roughly what it is the first pass).

(First, I had to add timing code to ReadClasses: the code you posted
doesn't include them, and only shows timings for ReadLines.)

Your program uses quite a bit of memory. I guess it gets harder and
harder to allocate the required amounts of memory.

If I change this line in ReadClasses:

built_classes[len(built_classes)] = HugeClass(long_line)

to

dummy = HugeClass(long_line)

then both times the files are read and your data structures are built,
but after each run the data structure is freed. The result is that both
runs are equally fast.

Also, if I run the first version (without the dummy) on a computer with
a bit more memory (1 GiB), it seems there is no problem allocating
memory: both runs are equally fast.

I'm not sure how to speed things up here... you're doing much processing
on a lot of small chunks of data. I have a number of observations and
possible improvements though, and some might even speed things up a bit.

You read the files, but don't use the contents; instead you use
long_line over and over. I suppose you do that because this is a test,
not your actual code?

__init__() with nothing (or only return) in it is not useful; better to
just leave it out.

You have a number of return statements that don't do anything (i.e. they
return nothing (None actually) at the end of the function). A function
without return automatically returns None at the end, so it's better to
leave them out.

Similarly you don't need to call sys.exit(): the script will terminate
anyway if it reaches the end. Better leave it out.

LongClass.clear() doesn't do anything and isn't called anyway; leave it out.

ModerateClass.__del__() doesn't do anything either. I'm not sure how it
affects what happens if ModerateClass gets freed, but I suggest you
don't start messing with __del__() until you have more Python knowledge
and experience. I'm not sure why you think you need to implement that
method.
The same goes for HugeClass.__del__(). It does delete self.B4v, but the
default behavior will do that too. Again, I don't get why you want to
override the default behavior.

In a number of cases, you use a dict like this:

built_classes = {}
for i in LINES:
built_classes[len(built_classes)] = ...

So you're using the indices 0, 1, 2, ... as the keys. That's not what
dictionaries are made for; lists are much better for that:

built_classes = []
for i in LINES:
built_classes.append(...)

HugeClass.B4v isn't used, so you can safely remove it.

Your readLines() function reads a whole file into memory. If you're
working with large files, that's not such a good idea. It's better to
load one line at a time into memory and work on that. I would even
completely remove readLines() and restructure ReadClasses() like this:

def ReadClasses(filename):
print 'Now reading ...'

built_classes = []

# Open file
in_file = open(filename, 'r')

# Read lines and interpret them.
time_a = time.time()
for i in in_file:
## This is alternative 1.
built_classes.append(HugeClass(long_line))
## The next line is alternative 2.
## built_classes[len(built_classes)] = long_line

in_file.close()
time_b = time.time()
print "DEBUG readClasses took %.3f s" % (time_b - time_a)

Personally I only use 'i' for integer indices (as in 'for i in
range(10)'); for other use I prefer more descriptive names:

for line in in_file: ...

But I guess that's up to personal preference. Also you used LINES to
store the file contents; the convention is that names with all capitals
are used for constants, not for things that change.

In ProcessList(), you keep the index in a separate variable. Python has
a trick so you don't have to do that yourself:

nfiles = len(input_files)
for file_index, i in enumerate(input_files):
print "--> %i/%i: %s" % (file_index + 1, nfiles, i)
ReadClasses(i)

Instead of item0, item1, ... , it's generally better to use a list, so
you can use item[0], item[1], ...

And finally, John Machin's suggestion looks like a good way to
restructure that long sequence of conversions and assignments in HugeClass.

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven

Jeroen Hegeman · Sep 24, 2007

Thanks for the comments,

(First, I had to add timing code to ReadClasses: the code you posted
doesn't include them, and only shows timings for ReadLines.)

Your program uses quite a bit of memory. I guess it gets harder and
harder to allocate the required amounts of memory.

Well, I guess there could be something in that, but why is there a
significant increase after the first time? And after that, single-
trip time pretty much flattens out. No more obvious increases.

If I change this line in ReadClasses:

built_classes[len(built_classes)] = HugeClass(long_line)

to

dummy = HugeClass(long_line)

then both times the files are read and your data structures are built,
but after each run the data structure is freed. The result is that
both
runs are equally fast.

Isnt't the 'del LINES' supposed to achieve the same thing? And
really, reading 30MB files should not be such a problem, right? (I'm
also running with 1GB of RAM.)

I'm not sure how to speed things up here... you're doing much
processing
on a lot of small chunks of data. I have a number of observations and
possible improvements though, and some might even speed things up a
bit.

Cool thanks, let's go over them.

You read the files, but don't use the contents; instead you use
long_line over and over. I suppose you do that because this is a test,
not your actual code?

Yeah ;-) (Do I notice a lack of trust in the responses I get? Should
I not mention 'newbie'?)

Let's get a couple of things out of the way:
- I do know about meaningful variable names and case-conventions,
but ... First of all I also have to live with inherited code (I don't
like people shouting in their code either), and secondly (all the
itemx) most of these members normally _have_ descriptive names but
I'm not supposed to copy-paste the original code to any newsgroups.
- I also know that a plain 'return' in python does not do anything
but I happen to like them. Same holds for the sys.exit() call.
- The __init__ methods normally actually do something: they
initialise some member variables to meaningful values (by calling the
clear() method, actually).
- The __clear__ method normally brings objects back into a well-
defined 'empty' state.
- The __del__ methods are actually needed in this case (well, in the
_real_ code anyway). The python code loads a module written in C++
and some of the member variables actually point to C++ objects
created dynamically, so one actually has to call their destructors
before unbinding the python var.

I tried to get things down to as small as possible, but when I found
out that the size of the classes seems to contribute to the issue
(removing enough member variables will bring you to a point where all
of a sudden the speed increases a factor ten, there seems to be some
breakpoint depending on the size of the classes) I could not simply
remove all members but had to give them funky names. I kept the main
structure of things, though, to see if that would solicit comments.
(And it did...)

In a number of cases, you use a dict like this:

built_classes = {}
for i in LINES:
built_classes[len(built_classes)] = ...

So you're using the indices 0, 1, 2, ... as the keys. That's not what
dictionaries are made for; lists are much better for that:

built_classes = []
for i in LINES:
built_classes.append(...)

Yeah, I inherited that part...

Your readLines() function reads a whole file into memory. If you're
working with large files, that's not such a good idea. It's better to
load one line at a time into memory and work on that. I would even
completely remove readLines() and restructure ReadClasses() like this:

Actually, part of what I removed was the real reason why readLines()
is there at all: it reads files in blocks of (at most) some_number
lines, and keeps track of the line offset in the file. I kept this
structure hoping that someone would point out something obvious like
some internal buffer going out of scope or whatever.

All right, thanks for the tips. I guess the issue itself is still
open, though.

Cheers,
Jeroen

Jeroen Hegeman
jeroen DOT hegeman AT gmail DOT com

WARNING: This message may contain classified information. Immediately
burn this message after reading.

Jeroen Hegeman · Sep 24, 2007

Your code does NOT include any statements that could have produced the
above line of output -- IOW, you have not posted the code that you
actually ran.

Oh my, I must have cleaned it up a bit too much, hoping that people
would focus on the issue instead of the formatting of the output
strings! Did you miss your morning coffee???

Your code is already needlessly monstrously large.

Which I realised and apologised for beforehand.

And Python 2.5.1 does what? Strike 3.

Hmm, I must have missed where it said that you can only ask for help
if you're using the latest version... In case you're wondering, 2.5.1
is not _really_ that wide-spread as most of the older versions.

For handling the bit extraction stuff, either [snip]
(b) do a loop over the bit positions

Now that sounds more useful. I'll give that a try.

Thanks,
Jeroen

Jeroen Hegeman
jeroen DOT hegeman AT gmail DOT com

WARNING: This message may contain classified information. Immediately
burn this message after reading.

Istvan Albert · Sep 24, 2007

Two comments,

...
self.item3 = float(foo[c]); c+=1
self.item4 = float(foo[c]); c+=1
self.item5 = float(foo[c]); c+=1
self.item6 = float(foo[c]); c+=1
...

this here (and your code in general) is mind boggling and not in a
good way,

as for you original question, I don't think that reading in files of
the size you mention can cause any substantial problems, I think the
problem is somewhere else,

you can run the code below to see that the read times are unaffected
by the order of processing

----------

import timeit

# make a big file
NUM= 10**5
fp = open('bigfile.txt', 'wt')
longline = ' ABC '* 60 + '\n'
for count in xrange( NUM ):
fp.write( longline )
fp.close()

setup1 = """
def readLines():
data = []
for line in file('bigfile.txt'):
data.append( line )
return data
"""

stmt1 = """
data = readLines()
"""

stmt2 = """
data = readLines()
data = readLines()
"""

stmt3 = """
data = file('bigfile.txt').readlines()
"""

def run( setup, stmt, N=5 ):
t = timeit.Timer(stmt=stmt, setup=setup)
msec = 1000 * t.timeit(number=N)/N
print "%f msec/pass" % msec

if __name__ == '__main__':
for stmt in (stmt1, stmt2, stmt3):
run(setup=setup1, stmt=stmt)

John Machin · Sep 24, 2007

Oh my, I must have cleaned it up a bit too much, hoping that people
would focus on the issue instead of the formatting of the output
strings! Did you miss your morning coffee???

The difference was not a formatting difference; it was complete
absence of a statement, raising the question of what other non-obvious
differences there might be.

You miss the point: if it is obvious that the posted code did not
produce the posted output (common when newbies are thrashing around
trying to solve a problem), some of the audience may not bother trying
to help with the main issue -- they may attempt to help with side
issues (as I did with the fugly code bloat) or just ignore you
altogether.

Which I realised and apologised for beforehand.

An apology does not change the fact that the code was needlesly large
(AND needed careful post-linefolding reformatting just to make it
runnable) and so some may not have bothered to read it.

Hmm, I must have missed where it said that you can only ask for help
if you're using the latest version...

You missed the point again: that your problem may be fixed in a later
version.

In case you're wondering, 2.5.1
is not _really_ that wide-spread as most of the older versions.

I wasn't wondering. I know. I maintain a package (xlrd) which works on
Python 2.5 all the way back to 2.1. It occasionally has possibly
similar "second iteration goes funny" issues (e.g. when reading 120MB
Excel spreadsheet files one after the other). You mention that
removing some attributes from a class may make your code stop
exhibiting cliff-face behaviour. If you can produce two versions of
your code that actually demonstrate the abrupt change, I'd be quite
interested in digging into it, to our possible mutual benefit.

For handling the bit extraction stuff, either [snip]
(b) do a loop over the bit positions

Click to expand...

Now that sounds more useful. I'll give that a try.

I'm glad you found something possibly more useful in my posting

Cheers,
John

Roel Schroeven · Sep 25, 2007

Jeroen Hegeman schreef:

Thanks for the comments,

Well, I guess there could be something in that, but why is there a
significant increase after the first time? And after that, single-
trip time pretty much flattens out. No more obvious increases.

Sorry, I have no idea.

If I change this line in ReadClasses:

built_classes[len(built_classes)] = HugeClass(long_line)

to

dummy = HugeClass(long_line)

then both times the files are read and your data structures are built,
but after each run the data structure is freed. The result is that
both
runs are equally fast.

Click to expand...

Isnt't the 'del LINES' supposed to achieve the same thing? And
really, reading 30MB files should not be such a problem, right? (I'm
also running with 1GB of RAM.)

'del LINES' deletes the lines that are read from the file, but not all
of your data structures that you created out of them.
Now, indeed, reading 30 MB files should not be a problem. And I am
confident that just reading the data is not a problem. To make sure I
created a simple test:

import time

input_files = ["./test_file0.txt", "./test_file1.txt"]

total_start = time.time()
data = {}
for input_fn in input_files:
file_start = time.time()
f = file(input_fn, 'r')
data[input_fn] = f.read()
f.close()
file_done = time.time()
print '%s: %f to read %d bytes' % (input_fn, file_done -
file_start, len(data))
total_done = time.time()
print 'all done in %f' % (total_done - total_start)

When I run that with test_file0.txt and test_file1.txt as you described
(each 30 MB), I get this output:

../test_file0.txt: 0.260000 to read 1 bytes
../test_file1.txt: 0.251000 to read 2 bytes
all done in 0.521000

Therefore I think the problem is not in reading the data, but in
processing it and creating the data structures.

Yeah ;-) (Do I notice a lack of trust in the responses I get? Should
I not mention 'newbie'?)

I didn't mean to attack you; it's just that the program reads 30 MB of
data, twice, but doesn't do anything with it. It only uses the data that
was stored in long_lines, and which never is replaced. That is very
strange for real code, but as a test it can have it's uses. That's why I
asked.

Let's get a couple of things out of the way:
- I do know about meaningful variable names and case-conventions,
but ... First of all I also have to live with inherited code (I don't
like people shouting in their code either), and secondly (all the
itemx) most of these members normally _have_ descriptive names but
I'm not supposed to copy-paste the original code to any newsgroups.
Ok.

- I also know that a plain 'return' in python does not do anything
but I happen to like them. Same holds for the sys.exit() call.
Ok.

- The __init__ methods normally actually do something: they
initialise some member variables to meaningful values (by calling the
clear() method, actually).
- The __clear__ method normally brings objects back into a well-
defined 'empty' state.
- The __del__ methods are actually needed in this case (well, in the
_real_ code anyway). The python code loads a module written in C++
and some of the member variables actually point to C++ objects
created dynamically, so one actually has to call their destructors
before unbinding the python var.

That sounds a bit weird to me; I would think such explicit memory
management belongs in the C++ code instead of in the Python code, but I
must admit that I know next to nothing about extending Python so I
assume you are right.

All right, thanks for the tips. I guess the issue itself is still
open, though.

I'm afraid so. Sorry I can't help.

One thing that helped me in the past to speed up input is using memory
mapped I/O instead of stream I/O. But that was in C++ on Windows; I
don't know if the same applies to Python on Linux.

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven

Roel Schroeven · Sep 25, 2007

Roel Schroeven schreef:

import time

input_files = ["./test_file0.txt", "./test_file1.txt"]

total_start = time.time()
data = {}
for input_fn in input_files:
file_start = time.time()
f = file(input_fn, 'r')
data[input_fn] = f.read()
f.close()
file_done = time.time()
print '%s: %f to read %d bytes' % (input_fn, file_done -
file_start, len(data))

.... that should of course be len(data[input_fn]) ...

total_done = time.time()
print 'all done in %f' % (total_done - total_start)

When I run that with test_file0.txt and test_file1.txt as you described
(each 30 MB), I get this output:

./test_file0.txt: 0.260000 to read 1 bytes
./test_file1.txt: 0.251000 to read 2 bytes
all done in 0.521000

.... and then that becomes:

../test_file0.txt: 0.290000 to read 33170000 bytes
../test_file1.txt: 0.231000 to read 33170000 bytes
all done in 0.521000

--
The saddest aspect of life right now is that science gathers knowledge
faster than society gathers wisdom.
-- Isaac Asimov

Roel Schroeven

Explore the Power of AI: Build Your Own Console Chatbot Using GPT-2 XL in Python	2	Mar 17, 2026
Drawing missing in bitmap in a pure C win32 program	4	Jun 3, 2023
Issues with writing pytest	0	Sep 9, 2022
Weird Behavior with Rays in C and OpenGL	4	Feb 12, 2024
Register Question	0	Oct 21, 2024
Python battle game help	2	Feb 23, 2023
Translater + module + tkinter	1	Feb 16, 2023
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023

Newbie completely confused

Jeroen Hegeman

John Machin

Roel Schroeven

Jeroen Hegeman

Jeroen Hegeman

Istvan Albert

John Machin

Roel Schroeven

Roel Schroeven

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads