BIG memory problem

T

tobyclemson

Hi all,

I'm having a really odd memory problem with a small ruby program I've
written. It basically takes in lines from input files (which represent
router flows), deduplicates them (based on elements of the line) and
outputs the unique flows to file. The input file often contains over
300,000 lines of which about 25-30% are duplicates. The trouble I'm
having is that the program (which is intended to be long running) does
not seem to release any memory back to the system and in fact just
increases in memory footprint from iteration to iteration. It should
use about 150 MB by my estimates but sails through this and yesterday
slowed to a halt at about 1.6GB (due to the GC by my guess). This
doesn't make any sense to me as at times I am deleting data structures
that occupy at least 50MB of memory.

The codebase is slightly to big too big to pastie but it is available
here http://svn.tobyclemson.co.uk/public/trunk/flow_deduplicator .
There are actually only 2 classes of importance and 1 script but I
don't know if pastie can handle that.

Any help would be greatly appreciated as the alternative (pressures
from above) is to rewrite in Python (which involves me learning
Python)

Thanks in advance,
Toby
 
S

Stefan Lang

2008/8/8 [email protected] said:
Hi all,

I'm having a really odd memory problem with a small ruby program I've
written. It basically takes in lines from input files (which represent
router flows), deduplicates them (based on elements of the line) and
outputs the unique flows to file. The input file often contains over
300,000 lines of which about 25-30% are duplicates. The trouble I'm
having is that the program (which is intended to be long running) does
not seem to release any memory back to the system and in fact just
increases in memory footprint from iteration to iteration. It should
use about 150 MB by my estimates but sails through this and yesterday
slowed to a halt at about 1.6GB (due to the GC by my guess). This
doesn't make any sense to me as at times I am deleting data structures
that occupy at least 50MB of memory.

The codebase is slightly to big too big to pastie but it is available
here http://svn.tobyclemson.co.uk/public/trunk/flow_deduplicator .
There are actually only 2 classes of importance and 1 script but I
don't know if pastie can handle that.

Any help would be greatly appreciated as the alternative (pressures
from above) is to rewrite in Python (which involves me learning
Python)

I _think_ I have found a problem. In the main loop (in bin/dedupe),
you use a single Timestamp instance, which is destructively
modified by calling advance.

Now this single Timestamp instance is used as a key for _all_
calls to checksum_buffer.add(). As a result, the @buffers hash
will always have only one entry and this single entry will hold _all_
flow.checksum/flow.timestamp pairs ever. Since the retention treshhold
is 1, this single @buffers entry that hold _all_ data will never be
deleted.

The solution should be to make Timestamp#advance nondestructive
and change the line

timestamp.advance

in the main loop to

timestamp = timestamp.advance

Stefan
 
T

tobyclemson

Sorry I don't quite understand the problem - I can see that it
probably is one but I think it's a matter of terminology. What do you
mean when you say destructively modified? I am modifying the value of
the timestamp in place? So that any reference to that timestamp will
be modified too? Should I be doing a duplication on the string that is
used to key the buffer in the buffers hash? I didn't think that the
actual object was passed in when an argument is supplied, I thought a
copy of it was passed in..

How would I make Timestamp#advance nondestructive?
If it is easier than pasting here I can give you commmit priveleges on
that repository?

Thanks very much for your help,
Toby
 
S

Stefan Lang

2008/8/8 [email protected] said:
Sorry I don't quite understand the problem - I can see that it
probably is one but I think it's a matter of terminology. What do you
mean when you say destructively modified? I am modifying the value of
the timestamp in place? So that any reference to that timestamp will
be modified too? Should I be doing a duplication on the string that is
used to key the buffer in the buffers hash? I didn't think that the
actual object was passed in when an argument is supplied, I thought a
copy of it was passed in..

How would I make Timestamp#advance nondestructive?
If it is easier than pasting here I can give you commmit priveleges on
that repository?


Arguments are passed by reference. Not a reference to the variable,
but a reference to the object. That's how most OO languages work.

Regarding your program: Add an accessor for the :time to the
Timestamp class, then change the advance definition
to this:

def advance
ts = self.dup
ts.time += 60
ts
end

Instead of modifying the instance, we create a new one with the
desired change.

Now in the main loop in dedupe change this line:

timestamp.advance

to

timestamp = timestamp.advance

This way ChecksumBuffer#add will actually get a different
timestamp object on each call.

Since you also use Enumerable#min on an array of Timestamp
objects, you need to add Timestamp#<=>:

def <=>(other)
self.time <=> other.time
end

That should do it.
Thanks very much for your help,

You're welcome!

Stefan
 
T

tobyclemson

Ok I've gone and had a little play and yes the memory problem was
completely my fault. I was passing in the timestamp to use as the key
for the buffer rather than the current value of the timestamp. By
changing the line checksum_buffer.add(flow, timestamp) to
checksum_buffer.add(flow, timestamp.current) the problems are solved!
It's just a shame it took me nearly a day of debugging and attempting
to learn Python and help from you guys to work that out!

Stefan, Edward, Thanks again for your help. I would never have noticed
that bug without your help Stefan,
Thanks,
Toby
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,066
Latest member
VytoKetoReviews

Latest Threads

Top