P
perfreem
hi,
i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:
class myclass(object):
__slots__ = ("a", "b", "c", "d")
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __str__(self):
return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
def __hash__(self):
return hash((self.a, self.b, self.c, self.d))
def __eq__(self, other):
return (self.a == other.a and \
self.b == other.b and \
self.c == other.c and \
self.d == other.d)
__repr__ = __str__
n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):
myobj = myclass('a' + str(k), 'b', 'c', 'd')
table[myobj] = 1
t2 = time.time()
print "time: ", float((t2-t1)/60.0)
this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.
in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...
i am doing a series of very simple string operations on lines i am
reading from a large file (~15 million lines). i store the result of
these operations in a simple instance of a class, and then put it
inside of a hash table. i found that this is unusually slow... for
example:
class myclass(object):
__slots__ = ("a", "b", "c", "d")
def __init__(self, a, b, c, d):
self.a = a
self.b = b
self.c = c
self.d = d
def __str__(self):
return "%s_%s_%s_%s" %(self.a, self.b, self.c, self.d)
def __hash__(self):
return hash((self.a, self.b, self.c, self.d))
def __eq__(self, other):
return (self.a == other.a and \
self.b == other.b and \
self.c == other.c and \
self.d == other.d)
__repr__ = __str__
n = 15000000
table = defaultdict(int)
t1 = time.time()
for k in range(1, n):
myobj = myclass('a' + str(k), 'b', 'c', 'd')
table[myobj] = 1
t2 = time.time()
print "time: ", float((t2-t1)/60.0)
this takes a very long time to run: 11 minutes!. for the sake of the
example i am not reading anything from file here but in my real code i
do. also, i do 'a' + str(k) but in my real code this is some simple
string operation on the line i read from the file. however, i found
that the above code shows the real bottle neck, since reading my file
into memory (using readlines()) takes only about 4 seconds. i then
have to iterate over these lines, but i still think that is more
efficient than the 'for line in file' approach which is even slower.
in the above code is there a way to optimize the creation of the class
instances ? i am using defaultdicts instead of ordinary ones so i dont
know how else to optimize that part of the code. is there a way to
perhaps optimize the way the class is written? if takes only 3 seconds
to read in 15 million lines into memory it doesnt make sense to me
that making them into simple objects while at it would take that much
more...