Slow down while creating a big list and iterating over it

marc magrans de abril · Jan 30, 2010

Dear colleagues,

I was doing a small program to classify log files for a cluster of
PCs, I just wanted to simplify a quite repetitive task in order to
find errors and so.

My first naive implementation was something like:
patterns = []
while(logs):
pattern = logs[0]
new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
entry = (len(logs)-len(new_logs),pattern)
patterns.append(entry)
logs = new_logs

Where dist(...) is the levenshtein distance (i.e. edit distance) and
logs is something like 1.5M logs (700 MB file). I thought that python
will be an easy choice although not really fast..

I was not surprised when the first iteration of the while loop was
taking ~10min. I thought "not bad, let's how much it takes". However,
it seemed that the second iteration never finished.

My surprise was big when I added a print instead of the list
comprehension:
new_logs=[]
for count,l in enumerate(logs):
print count
if dist(pattern,l)>THERESHOLD:
new_logs.append(l)

The surprise was that the displayed counter was running ~10 times
slower on the second iteration of the while loop.

I am a little lost. Anyone knows the reson of this behavior? How
should I write a program that deals with large data sets in python?

Thanks a lot!
marc magrans de abril

Alf P. Steinbach · Jan 30, 2010

* marc magrans de abril:

Dear colleagues,

I was doing a small program to classify log files for a cluster of
PCs, I just wanted to simplify a quite repetitive task in order to
find errors and so.

My first naive implementation was something like:
patterns = []
while(logs):
pattern = logs[0]
new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
entry = (len(logs)-len(new_logs),pattern)
patterns.append(entry)
logs = new_logs

Where dist(...) is the levenshtein distance (i.e. edit distance) and
logs is something like 1.5M logs (700 MB file). I thought that python
will be an easy choice although not really fast..

I was not surprised when the first iteration of the while loop was
taking ~10min. I thought "not bad, let's how much it takes". However,
it seemed that the second iteration never finished.

My surprise was big when I added a print instead of the list
comprehension:
new_logs=[]
for count,l in enumerate(logs):
print count
if dist(pattern,l)>THERESHOLD:
new_logs.append(l)

The surprise was that the displayed counter was running ~10 times
slower on the second iteration of the while loop.

I am a little lost. Anyone knows the reson of this behavior?

It's on line 42 of your program.

That is, it's in the dist function.
Evidently it doesn't like a more complex 'pattern'.

How should I write a program that deals with large data sets in python?

As in any other language. Try to avoid repeating the same computations. Try to
make the data fit the computational task.

Cheers & hth.,

- Alf

MRAB · Jan 30, 2010

Alf said:
* marc magrans de abril:

Dear colleagues,

I was doing a small program to classify log files for a cluster of
PCs, I just wanted to simplify a quite repetitive task in order to
find errors and so.

My first naive implementation was something like:
patterns = []
while(logs):
pattern = logs[0]
new_logs = [l for l in logs if dist(pattern,l)>THERESHOLD]
entry = (len(logs)-len(new_logs),pattern)
patterns.append(entry)
logs = new_logs

Where dist(...) is the levenshtein distance (i.e. edit distance) and
logs is something like 1.5M logs (700 MB file). I thought that python
will be an easy choice although not really fast..

I was not surprised when the first iteration of the while loop was
taking ~10min. I thought "not bad, let's how much it takes". However,
it seemed that the second iteration never finished.

My surprise was big when I added a print instead of the list
comprehension:
new_logs=[]
for count,l in enumerate(logs):
print count
if dist(pattern,l)>THERESHOLD:
new_logs.append(l)

The surprise was that the displayed counter was running ~10 times
slower on the second iteration of the while loop.

I am a little lost. Anyone knows the reson of this behavior?

Click to expand...

It's on line 42 of your program. That is, it's in the dist function.
Evidently it doesn't like a more complex 'pattern'.

Find out which pattern is being used on the second iteration and then
try it on the first iteration. Is it just as slow? If so, then it's down
to the length/complexity of that pattern - a much longer/more complex
pattern might take much longer when computing the distance.

As in any other language. Try to avoid repeating the same computations.
Try to make the data fit the computational task.

True. Basically, you're computing the distance between every pair of
logs!

Aliaksandr Abushkevich · Jan 31, 2010

Maybe it is a good idea to use Disco (http://discoproject.org/) to
process your data.

Yours faithfully,
Alexander Abushkevich

marc magrans de abril · Jan 31, 2010

Find out which pattern is being used on the second iteration and then try it on the first iteration. Is it just as slow?
You were right, the second pattern was 1891 bytes but the first was
just 142

I will need to put more thought than I expect in the "small script".

marc magrans de abril · Jan 31, 2010

Hi!

....I have found a good enough solution, although it only works if the
number of patterns (clusters) is not very big:
def classify(f):
THERESHOLD=0.1

patterns={}
for l in enumerate(f):
found = False
for p,c in patterns.items():
if dist(l,p) < THERESHOLD:
found=True
patterns[p] = c +1

if not found:
patterns[l] = 1

return patterns

This algorithm is O(n*np*m^2). Where n is the number of logs, np the
number of patterns, and m is the log length (i.e. m^2 is the distance
cost). So it's way better O(n^2*m^2) and I can run it for some hours
to get back the results.

I wonder if there is a single threaded/process clustering algorithm
than runs in O(n)?

Cheers,
marc

MRAB · Jan 31, 2010

marc said:
Hi!

...I have found a good enough solution, although it only works if the
number of patterns (clusters) is not very big:
def classify(f):
THERESHOLD=0.1

patterns={}
for l in enumerate(f):
found = False
for p,c in patterns.items():
if dist(l,p) < THERESHOLD:
found=True
patterns[p] = c +1

if not found:
patterns[l] = 1

return patterns

This algorithm is O(n*np*m^2). Where n is the number of logs, np the
number of patterns, and m is the log length (i.e. m^2 is the distance
cost). So it's way better O(n^2*m^2) and I can run it for some hours
to get back the results.

I wonder if there is a single threaded/process clustering algorithm
than runs in O(n)?

Your original code used the first entry in the remaining logs for each
pattern, but your new code stores the patterns in a dict, which is
unordered, so you might get different results.

But that doesn't matter, because your new code increments the count when
it has found a match, and then continues looking, so it might match and
increment more than once.

Finally, your original code treated it as a match if distance <=
threshold but your new code treats it as a match if distance <
threshold.

patterns = []
for x in logs:
for index, (pat, count) in enumerate(patterns):
if dist(pat, x) <= THRESHOLD:
patterns[index] = pat, count + 1
break
else:
# Didn't break out of the loop, therefore no match.
patterns.append((x, 1))

iterating over a list as if it were a circular list	3	Mar 7, 2013
String and list error while running a Markov Chain	1	Aug 26, 2020
iterating over a list as if it were a circular list	1	Mar 7, 2013
Deleting from a list while iterating	9	Dec 3, 2006
modifying a list while iterating through	4	Feb 25, 2007
Iterating over objects of a class	6	Dec 24, 2008
groveling over a file for Q:: and A:: stmts	3	Jul 24, 2012
FAQ 4.57 What happens if I add or remove keys from a hash while iterating over it?	0	Jan 10, 2011

Slow down while creating a big list and iterating over it

marc magrans de abril

Alf P. Steinbach

MRAB

Aliaksandr Abushkevich

marc magrans de abril

marc magrans de abril

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads