how to optimize object creation/reading from file?

perfreem · Jan 28, 2009

hi,

i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:

class myclass(object):
__slots__ = ("a", "b", "c", "d")
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __str__(self):
return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
def __hash__(self):
return hash((self.a, self.b, self.c, self.d))
def __eq__(self, other):
return (self.a == other.a and \
self.b == other.b and \
self.c == other.c and \
self.d == other.d)
__repr__ = __str__

n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):
myobj = myclass('a' + str(k), 'b', 'c', 'd')
table[myobj] = 1
t2 = time.time()
print "time: ", float((t2-t1)/60.0)

this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.

in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...

Bruno Desthuilliers · Jan 28, 2009

(e-mail address removed) a écrit :

hi,

i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:

class myclass(object):
__slots__ = ("a", "b", "c", "d")
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __str__(self):
return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
def __hash__(self):
return hash((self.a, self.b, self.c, self.d))
def __eq__(self, other):
return (self.a == other.a and \
self.b == other.b and \
self.c == other.c and \
self.d == other.d)
__repr__ = __str__

If your class really looks like that, a tuple would be enough.

n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):

hint : use xrange instead.

myobj = myclass('a' + str(k), 'b', 'c', 'd')
table[myobj] = 1

hint : if all you want is to ensure unicity, use a set instead.

t2 = time.time()
print "time: ", float((t2-t1)/60.0)

hint : use timeit instead.

this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.

iterating over the file, while indeed a bit slower on a per-line basis,
avoid useless memory comsuption which can lead to disk swapping - so for
"huge" files, it might still be better wrt/ overall performances.

in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...

Did you bench the creation of a 15.000.000 ints list ?-)

But anyway, creating 15.000.000 instances (which is not a small number)
of your class takes many seconds - 23.466073989868164 seconds on my
(already heavily loaded) machine. Building the same number of tuples
only takes about 2.5 seconds - that is, almost 10 times less. FWIW,
tuples have all the useful characteristics of your above class (wrt/
hashing and comparison).

My 2 cents...

perfreem · Jan 28, 2009

(e-mail address removed) a écrit :

hi,

Click to expand...

i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:

Click to expand...

class myclass(object):
__slots__ = ("a", "b", "c", "d")
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __str__(self):
return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
def __hash__(self):
return hash((self.a, self.b, self.c, self.d))
def __eq__(self, other):
return (self.a == other.a and \
self.b == other.b and \
self.c == other.c and \
self.d == other.d)
__repr__ = __str__

Click to expand...

If your class really looks like that, a tuple would be enough.

n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):

Click to expand...

hint : use xrange instead.

myobj = myclass('a' + str(k), 'b', 'c', 'd')
table[myobj] = 1

Click to expand...

hint : if all you want is to ensure unicity, use a set instead.

t2 = time.time()
print "time: ", float((t2-t1)/60.0)

Click to expand...

hint : use timeit instead.

this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.

Click to expand...

iterating over the file, while indeed a bit slower on a per-line basis,
avoid useless memory comsuption which can lead to disk swapping - so for
"huge" files, it might still be better wrt/ overall performances.

in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...

Click to expand...

Did you bench the creation of a 15.000.000 ints list ?-)

But anyway, creating 15.000.000 instances (which is not a small number)
of your class takes many seconds - 23.466073989868164 seconds on my
(already heavily loaded) machine. Building the same number of tuples
only takes about 2.5 seconds - that is, almost 10 times less. FWIW,
tuples have all the useful characteristics of your above class (wrt/
hashing and comparison).

My 2 cents...

thanks for your insight ful reply - changing to tuples made a big
change!

getting the state of an object	4	Oct 7, 2012
How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	1	Mar 22, 2024
Instance attributes vs method arguments	8	Nov 25, 2008
create instance attributes for every method argument	1	Jul 19, 2008
using attributes as defaults	4	Feb 4, 2011
extracting variables accessed and written from function / rule-basedfunction calls	4	Nov 1, 2010
Class decorator to capture the creation and deletion of objects	0	Feb 25, 2014
object persistency, store instances relationship externally	2	Jul 25, 2008

how to optimize object creation/reading from file?

perfreem

Bruno Desthuilliers

perfreem

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads