how to optimize object creation/reading from file?

P

perfreem

hi,

i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:

class myclass(object):
__slots__ = ("a", "b", "c", "d")
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __str__(self):
return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
def __hash__(self):
return hash((self.a, self.b, self.c, self.d))
def __eq__(self, other):
return (self.a == other.a and \
self.b == other.b and \
self.c == other.c and \
self.d == other.d)
__repr__ = __str__

n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):
myobj = myclass('a' + str(k), 'b', 'c', 'd')
table[myobj] = 1
t2 = time.time()
print "time: ", float((t2-t1)/60.0)

this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.

in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...
 
B

Bruno Desthuilliers

(e-mail address removed) a écrit :
hi,

i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:

class myclass(object):
__slots__ = ("a", "b", "c", "d")
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __str__(self):
return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
def __hash__(self):
return hash((self.a, self.b, self.c, self.d))
def __eq__(self, other):
return (self.a == other.a and \
self.b == other.b and \
self.c == other.c and \
self.d == other.d)
__repr__ = __str__


If your class really looks like that, a tuple would be enough.
n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):

hint : use xrange instead.
myobj = myclass('a' + str(k), 'b', 'c', 'd')
table[myobj] = 1

hint : if all you want is to ensure unicity, use a set instead.
t2 = time.time()
print "time: ", float((t2-t1)/60.0)

hint : use timeit instead.
this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.

iterating over the file, while indeed a bit slower on a per-line basis,
avoid useless memory comsuption which can lead to disk swapping - so for
"huge" files, it might still be better wrt/ overall performances.
in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...

Did you bench the creation of a 15.000.000 ints list ?-)

But anyway, creating 15.000.000 instances (which is not a small number)
of your class takes many seconds - 23.466073989868164 seconds on my
(already heavily loaded) machine. Building the same number of tuples
only takes about 2.5 seconds - that is, almost 10 times less. FWIW,
tuples have all the useful characteristics of your above class (wrt/
hashing and comparison).

My 2 cents...
 
P

perfreem

(e-mail address removed) a écrit :


i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:
class myclass(object):
    __slots__ = ("a", "b", "c", "d")
    def __init__(self, a, b, c, d):
        self.a = a
        self.b = b
        self.c = c
        self.d = d
    def __str__(self):
        return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
    def __hash__(self):
        return hash((self.a, self.b, self.c, self.d))
    def __eq__(self, other):
        return (self.a == other.a and \
                self.b == other.b and \
                self.c == other.c and \
                self.d == other.d)
    __repr__ = __str__

If your class really looks like that, a tuple would be enough.
n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):

hint : use xrange instead.
    myobj = myclass('a' + str(k), 'b', 'c', 'd')
    table[myobj] = 1

hint : if all you want is to ensure unicity, use a set instead.
t2 = time.time()
print "time: ", float((t2-t1)/60.0)

hint : use timeit instead.
this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.

iterating over the file, while indeed a bit slower on a per-line basis,
avoid useless memory comsuption which can lead to disk swapping - so for
  "huge" files, it might still be better wrt/ overall performances.
in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...

Did you bench the creation of a 15.000.000 ints list ?-)

But anyway, creating 15.000.000 instances (which is not a small number)
of your class takes many seconds - 23.466073989868164 seconds on my
(already heavily loaded) machine. Building the same number of tuples
only takes about 2.5 seconds - that is, almost 10 times less. FWIW,
tuples have all the useful characteristics of your above class (wrt/
hashing and comparison).

My 2 cents...

thanks for your insight ful reply - changing to tuples made a big
change!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top