Memory leak in Python

D

diffuser78

I have a python code which is running on a huge data set. After
starting the program the computer becomes unstable and gets very
diffucult to even open konsole to kill that process. What I am assuming
is that I am running out of memory.

What should I do to make sure that my code runs fine without becoming
unstable. How should I address the memory leak problem if any ? I have
a gig of RAM.

Every help is appreciated.
 
V

vbgunz

how big is the set? 100MB, more? what are you doing with the set? do
you have a small example that can prove the set is causing the freeze?
I am not the sharpest tool in the shed but it sounds like you might be
multiplying your set in/directly either permanently or temporarily on
purpose or accident.
 
S

Sybren Stuvel

(e-mail address removed) enlightened us with:
I have a python code which is running on a huge data set. After
starting the program the computer becomes unstable and gets very
diffucult to even open konsole to kill that process. What I am
assuming is that I am running out of memory.

Before acting on your assumptions, you need to verify them. Run 'top'
and hit 'M' to sort by memory usage. After that, use 'ulimit' to limit
the allowed memory usage, run your program again, and see if it stops
at some point due to memory problems.

Sybren
 
P

Peter Tillotson

1) Review your design - You say you are processing a large data set,
just make sure you are not trying to store 3 versions. If you are
missing a design, create a flow chart or something that is true to the
code you have produced. You could probably even post the design if you
are brave enough.

2) Check your implementation - make sure you manage lists, arrays etc
correctly. You need to sever links (references) to objects for them to
get swept up. I know it is obvious but easy to do in a hasty implementation.

3) Verify and test problem characteristics, profilers, top etc. It is
hard for us to help you much without more info. Test your assumptions.

Problem solving and debugging is a process, not some mystic art. Though
sometime the Gremlins disappear after a pint or two :)

p
 
D

Dennis Lee Bieber

I have a python code which is running on a huge data set. After
starting the program the computer becomes unstable and gets very
diffucult to even open konsole to kill that process. What I am assuming
is that I am running out of memory.

What should I do to make sure that my code runs fine without becoming
unstable. How should I address the memory leak problem if any ? I have
a gig of RAM.
Does the memory come back after the process exits?

You don't show any sample of code or data... Nor do you mention what
OS/processor is involved.

Many systems do not return /allocated/ memory to the OS until the
top-level process exits, even if the memory is "freed" from the
viewpoint of the process.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
B

bruno at modulix

I have a python code which is running on a huge data set. After
starting the program the computer becomes unstable and gets very
diffucult to even open konsole to kill that process. What I am assuming
is that I am running out of memory.

What should I do to make sure that my code runs fine without becoming
unstable. How should I address the memory leak problem if any ? I have
a gig of RAM.

Every help is appreciated.

Just a hint : if you're trying to load your whole "huge data set" in
memory, you're in for trouble whatever the language - for an example,
doing a 'buf = openedFile.read()' on a 100 gig file may not be a good
idea...
 
D

diffuser78

I am using Ubuntu Linux.

My program is a simulation program with four classes and it mimics bit
torrent file sharing systems on 2000 nodes. Now, each node has lot of
attributes and my program kinds of tries to keep tab of everything. As
I mentioned its a simulation program, it starts at time T=0 and goes on
untill all nodes have recieved all parts of the file(BitTorrent
concept). The ending time goes to thousands of seconds. In each sec I
process all the 2000 nodes.

Psuedo Code

Time = 0
while (True){
For all nodes in the system{
Process + computation
}
Time++
If (DownloadFinished == True) exit;
}
 
D

Dennis Lee Bieber

I am using Ubuntu Linux.

My program is a simulation program with four classes and it mimics bit
torrent file sharing systems on 2000 nodes. Now, each node has lot of
attributes and my program kinds of tries to keep tab of everything. As
I mentioned its a simulation program, it starts at time T=0 and goes on
untill all nodes have recieved all parts of the file(BitTorrent
concept). The ending time goes to thousands of seconds. In each sec I
process all the 2000 nodes.
Any chance each of your nodes is creating a whole allocation of the
same "data file" in memory? And those are not being freed at the end of
the "transfer".
Psuedo Code

Time = 0
while (True){
For all nodes in the system{
Process + computation
}
Time++
If (DownloadFinished == True) exit;
}
<eeek> C-code (or is it Java...) Given how many references refer to
Python as "executable pseudo-code" <G>

time = 0
while not downloadFinished:
for eachNode in system:
# process
time++

<G>
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
D

diffuser78

The amount of data I read in is actually small.

If you see my algorithm above it deals with 2000 nodes and each node
has ot of attributes.

When I close the program my computer becomes stable and performs as
usual. I check the performance in Performance monitor and using "top"
and the total memory is being used and on top of that around half a gig
swap memory is also being used.

Please give some helpful pointers to overcome such memory errors.

I revisited my code to find nothing so obvious which would let this
leak happen. How to kill cross references in the program. I am kinda
newbie and not completely aware of the finetuning such programming
process.

Thanks
 
K

Karthik Gurusamy

The amount of data I read in is actually small.

If you see my algorithm above it deals with 2000 nodes and each node
has ot of attributes.

When I close the program my computer becomes stable and performs as
usual. I check the performance in Performance monitor and using "top"
and the total memory is being used and on top of that around half a gig
swap memory is also being used.

Please give some helpful pointers to overcome such memory errors.

I revisited my code to find nothing so obvious which would let this
leak happen. How to kill cross references in the program. I am kinda
newbie and not completely aware of the finetuning such programming
process.

I suspect you are trying to store each node's attributes in every other
node.
Basically you have a O(N^2) algorithm (in space and probably more in
time).
For N=2000, N^2 is pretty big and you see memory issues.

Try not to store O(N^2) information and see if you can just scale
memory requirements linearly in N. That is, see if you can store
attributes of a node at only one place per node.

I'm just guessing your implementation; but from what you say
(peer-to-peer), I feel there is a O(N^2) requirements. Also try
experimenting with small N (100 nodes say).

Thanks,
Karthik
 
S

Sybren Stuvel

(e-mail address removed) enlightened us with:
My program is a simulation program with four classes and it mimics
bittorrent file sharing systems on 2000 nodes.

Wouldn't it be better to use an existing simulator? That way, you
won't have to do the stuff you don't want to think about, and focus on
the more interesting parts. There are plenty of discrete-event and
discrete-time simulators to choose from.

Sybren
 
S

Serge Orlov

I am using Ubuntu Linux.

My program is a simulation program with four classes and it mimics bit
torrent file sharing systems on 2000 nodes. Now, each node has lot of
attributes and my program kinds of tries to keep tab of everything. As
I mentioned its a simulation program, it starts at time T=0 and goes on
untill all nodes have recieved all parts of the file(BitTorrent
concept). The ending time goes to thousands of seconds. In each sec I
process all the 2000 nodes.

Most likely you keep references to objects you don't need, so python
garbage collector cannot remove those objects. If you cannot figure it
out looking at the source code, you can gather some statistics to help
you, for example use module gc to iterate over all objects in your
program (gc.get_objects()) and find out objects of which type are
growing with each iteration.
 
B

bruno at modulix

The amount of data I read in is actually small.

So the problem is probably elsewhere... Sorry, since you were talking
about huge dataset, the good old "read-whole-file-in-memory" antipattern
seemed an obvious guess.
If you see my algorithm above it deals with 2000 nodes and each node
has ot of attributes.

When I close the program my computer becomes stable and performs as
usual. I check the performance in Performance monitor and using "top"
and the total memory is being used and on top of that around half a gig
swap memory is also being used.

Please give some helpful pointers to overcome such memory errors.

A real memory leak would cause the memory usage to keep increasing as
long as your program is running. If this is not the case, it's not a
"memory error", but a design/program error. FWIW, apps like Zope can end
up using a whole lot of memory, but there's no known memory-leak problem
AFAIK. And believe me, a Zope app can end up managing a *really huge
lot* of objects (>= many thousands).
I revisited my code to find nothing so obvious which would let this
leak happen. How to kill cross references in the program.

Using weakref and/or gc might help.

FWIW, the default memory management in Python is based on
reference-counting. As long as anything keeps a reference to an object,
this object will stay alive. If you have lot of cross-references and
2000+ big objects, you may effectively end up eating all the ram and
more. The gc module can detect and manage some cyclic references (obj A
has a ref on obj B which has a ref on obj A). The weakref module uses
'proxy' references that let reference-counting do it's job (I guess the
doc will be much more explicit than me).

Another possible improvement could be to use the flyweight design
pattern to share memory for some attributes :

- a general (while somewhat Java-oriented) explanation:
http://www.exciton.cs.rice.edu/JavaResources/DesignPatterns/FlyweightPattern.htm

- two Python exemples (the second being based on the first)
http://www.suttoncourtenay.org.uk/duncan/accu/pythonpatterns.html#flyweight
http://push.cx/2006/python-flyweights

HTH
 
D

diffuser78

Sure, are there any available simulators...since i am modifying some
stuff i thought of creating one of my own. But if you know some
exisiting simlators , those can be of great help to me.

Thanks
 
D

diffuser78

I ran simulation for 128 nodes and used the following

oo = gc.get_objects()
print len(oo)

on every time step the number of objects are increasing. For 128 nodes
I had 1058177 objects.

I think I need to revisit the code and remove the references....but how
to do that. I am still a newbie coder and every help will be greatly
appreciated.

thanks
 
S

Sybren Stuvel

(e-mail address removed) enlightened us with:
Sure, are there any available simulators...since i am modifying some
stuff i thought of creating one of my own. But if you know some
exisiting simlators , those can be of great help to me.

Don't know any by name, but I'm sure you can find some on Google. Do
you need a discrete-event or a discrete-time simulator?

Sybren
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top