Please help with MemoryError

J

Jeremy

I have been using Python for several years now and have never run into
memory errors…

until now.

My Python program now consumes over 2 GB of memory and then I get a
MemoryError. I know I am reading lots of files into memory, but not
2GB worth. I thought I didn't have to worry about memory allocation
in Python because of the garbage collector. On this note I have a few
questions. FYI I am using Python 2.6.4 on my Mac.

1. When I pass a variable to the constructor of a class does it
copy that variable or is it just a reference/pointer? I was under the
impression that it was just a pointer to the data.
2. When do I need to manually allocate/deallocate memory and when
can I trust Python to take care of it?
3. Any good practice suggestions?

Thanks,
Jeremy
 
A

Alf P. Steinbach

* Jeremy:
I have been using Python for several years now and have never run into
memory errors…

until now.

My Python program now consumes over 2 GB of memory and then I get a
MemoryError. I know I am reading lots of files into memory, but not
2GB worth. I thought I didn't have to worry about memory allocation
in Python because of the garbage collector. On this note I have a few
questions. FYI I am using Python 2.6.4 on my Mac.

1. When I pass a variable to the constructor of a class does it
copy that variable or is it just a reference/pointer? I was under the
impression that it was just a pointer to the data.

Uhm, I discovered that "pointer" is apparently a Loaded Word in the Python
community. At least in some sub-community, so, best avoided. But essentially
you're just passing a reference to an object. The object is not copied.

2. When do I need to manually allocate/deallocate memory and when
can I trust Python to take care of it?

Python takes care of deallocation of objects that are /no longer referenced/.

3. Any good practice suggestions?

You need to get rid of references to objects before Python will garbage collect
them.

Typically, in a language like Python (or Java, C#...) memory leaks are caused by
keeping object references in singletons or globals, e.g. for purposes of event
notifications. For example, you may have some dictionary somewhere.

Such references from singletons/globals need to be removed.

You do not, however, need to be concerned about circular references, at least
unless you need some immediate deallocation.

For although circular references will prevent the objects involved from being
immediately deallocated, the general garbage collector will take care of them later.



Cheers & hth.,

- Alf
 
J

Jonathan Gardner

I have been using Python for several years now and have never run into
memory errors…

until now.

Yes, Python does a good job of making memory errors the least of your
worries as a programmer. Maybe it's doing too good of a job...
My Python program now consumes over 2 GB of memory and then I get a
MemoryError.  I know I am reading lots of files into memory, but not
2GB worth.

Do a quick calculation: How much are leaving around after you read in
a file? Do you create an object for each line? What does that object
have associated with it? You may find that you have some strange
O(N^2) behavior regarding memory here. Oftentimes people forget that
you have to evaluate how your algorithm will run in time *and* memory.
 I thought I didn't have to worry about memory allocation
in Python because of the garbage collector.

While it's not the #1 concern, you still have to keep track of how you
are using memory and try not to be wasteful. Use good algorithms, let
things fall out of scope, etc...
1.    When I pass a variable to the constructor of a class does it
copy that variable or is it just a reference/pointer?  I was under the
impression that it was just a pointer to the data.

For objects, until you make a copy, there is no copy made. That's the
general rule and even though it isn't always correct, it is correct
enough.
2.    When do I need to manually allocate/deallocate memory and when
can I trust Python to take care of it?

Let things fall out of scope. If you're concerned, use delete. Try to
avoid using the global namespace for everything, and try to keep your
lists and dicts small.
3.    Any good practice suggestions?

Don't read in the entire file and then process it. Try to do line-by-
line processing.

Figure out what your algorithm is doing in terms of time *and* memory.
You likely have some O(N^2) or worse in memory usage.

Don't use Python variables to store data long-term. Instead, setup a
database or a file and use that. I'd first look at using a file, then
using SQLite, and then a full-fledged database like PostgreSQL.

Don't write processes that sit around for a long time unless you also
evaluate whether that process grows in size as it runs. If it does,
you need to figure out why and stop that memory leak.

Simpler code uses less memory. Not just because it is smaller, but
because you are not copying and moving data all over the place. See
what you can do to simplify your code. Maybe you'll expose the nasty
O(N^2) behavior.
 
S

Steven D'Aprano

My Python program now consumes over 2 GB of memory and then I get a
MemoryError. I know I am reading lots of files into memory, but not 2GB
worth.

Are you sure?

Keep in mind that Python has a comparatively high overhead due to its
object-oriented nature. If you have a list of characters:

['a', 'b', 'c', 'd']

there is the (small) overhead of the list structure itself, but each
individual character is not a single byte, but a relatively large object:
32

So if you read (say) a 500MB file into a single giant string, you will
have 500MB plus the overhead of a single string object (which is
negligible). But if you read it into a list of 500 million single
characters, you will have the overhead of a single list, plus 500 million
strings, and that's *not* negligible: 32 bytes each instead of 1.

So try to avoid breaking a single huge strings into vast numbers of tiny
strings all at once.


I thought I didn't have to worry about memory allocation in
Python because of the garbage collector.

You don't have to worry about explicitly allocating memory, and you
almost never have to worry about explicitly freeing memory (unless you
are making objects that, directly or indirectly, contain themselves --
see below); but unless you have an infinite amount of RAM available of
course you can run out of memory if you use it all up :)

On this note I have a few
questions. FYI I am using Python 2.6.4 on my Mac.

1. When I pass a variable to the constructor of a class does it copy
that variable or is it just a reference/pointer? I was under the
impression that it was just a pointer to the data.

Python's calling model is the same whether you pass to a class
constructor or any other function or method:

x = ["some", "data"]
obj = f(x)

The function f (which might be a class constructor) sees the exact same
list as you assigned to x -- the list is not copied first. However,
there's no promise made about what f does with that list -- it might copy
the list, or make one or more additional lists:

def f(a_list):
another_copy = a_list[:]
another_list = map(int, a_list)

2. When do I need
to manually allocate/deallocate memory and when can I trust Python to
take care of it?

You never need to manually allocate memory.

You *may* need to deallocate memory if you make "reference loops", where
one object refers to itself:

l = [] # make an empty list
l.append(l) # add the list l to itself

Python can break such simple reference loops itself, but for more
complicated ones, you may need to break them yourself:

a = []
b = {2: a}
c = (None, b)
d = [1, 'z', c]
a.append(d) # a reference loop

Python will deallocate objects when they are no longer in use. They are
always considered in use any time you have them assigned to a name, or in
a list or dict or other structure which is in use.

You can explicitly remove a name with the del command. For example:

x = ['my', 'data']
del x

After deleting the name x, the list object itself is no longer in use
anywhere and Python will deallocate it. But consider:

x = ['my', 'data']
y = x # y now refers to THE SAME list object
del x

Although you have deleted the name x, the list object is still bound to
the name y, and so Python will *not* deallocate the list.

Likewise:

x = ['my', 'data']
y = [None, 1, x, 'hello world']
del x

Although now the list isn't bound to a name, it is inside another list,
and so Python will not deallocate it.


3. Any good practice suggestions?

Write small functions. Any temporary objects created by the function will
be automatically deallocated when the function returns.

Avoid global variables. They are a good way to inadvertently end up with
multiple long-lasting copies of data.

Try to keep data in one big piece rather than lots of little pieces.

But contradicting the above, if the one big piece is too big, it will be
hard for the operating system to swap it in and out of virtual memory,
causing thrashing, which is *really* slow. So aim for big, but not huge.

(By "big" I mean megabyte-sized; by "huge" I mean hundreds of megabytes.)

If possible, avoid reading the entire file in at once, and instead
process it line-by-line.


Hope this helps,
 
T

Tim Chase

Jonathan said:
Don't use Python variables to store data long-term. Instead, setup a
database or a file and use that. I'd first look at using a file, then
using SQLite, and then a full-fledged database like PostgreSQL.

Just to add to the mix, I'd put the "anydbm" module on the
gradient between "using a file" and "using sqlite". It's a nice
intermediate step between rolling your own file formats for data
on disk, and having to write SQL since access is entirely like
you'd do with a regular Python dictionary.

-tkc
 
A

Aahz

Just to add to the mix, I'd put the "anydbm" module on the gradient
between "using a file" and "using sqlite". It's a nice intermediate
step between rolling your own file formats for data on disk, and having
to write SQL since access is entirely like you'd do with a regular
Python dictionary.

Not quite. One critical difference between dbm and dicts is the need to
remember to "save" changes by setting the key's valud again.
 
T

Tim Chase

Aahz said:
Not quite. One critical difference between dbm and dicts is the need to
remember to "save" changes by setting the key's valud again.

Could you give an example of this? I'm not sure I understand
what you're saying. I've used anydbm a bunch of times and other
than wrapping access in

d = anydbm.open(DB_NAME, "c")
# use d as a dict here
d.close()

and I've never hit any "need to remember to save changes by
setting the key's value again". The only gotcha I've hit is the
anydbm requirement that all keys/values be strings. Slightly
annoying at times, but my most frequent use case.

-tkc
 
A

Aahz

Could you give an example of this? I'm not sure I understand
what you're saying. I've used anydbm a bunch of times and other
than wrapping access in

d = anydbm.open(DB_NAME, "c")
# use d as a dict here
d.close()

and I've never hit any "need to remember to save changes by
setting the key's value again". The only gotcha I've hit is the
anydbm requirement that all keys/values be strings. Slightly
annoying at times, but my most frequent use case.

Well, you're more likely to hit this by wrapping dbm with shelve (because
it's a little more obvious when you're using pickle directly), but
consider this:

d = anydbm.open(DB_NAME, "c")
x = MyClass()
d['foo'] = x
x.bar = 123

Your dbm does NOT have the change to x.bar recorded, you must do this
again:

d['foo'] = x

With a dict, you have Python's reference semantics.
 
J

Jeremy

My Python program now consumes over 2 GB of memory and then I get a
MemoryError.  I know I am reading lots of files into memory, but not 2GB
worth.

Are you sure?

Keep in mind that Python has a comparatively high overhead due to its
object-oriented nature. If you have a list of characters:

['a', 'b', 'c', 'd']

there is the (small) overhead of the list structure itself, but each
individual character is not a single byte, but a relatively large object:

 >>> sys.getsizeof('a')
32

So if you read (say) a 500MB file into a single giant string, you will
have 500MB plus the overhead of a single string object (which is
negligible). But if you read it into a list of 500 million single
characters, you will have the overhead of a single list, plus 500 million
strings, and that's *not* negligible: 32 bytes each instead of 1.

So try to avoid breaking a single huge strings into vast numbers of tiny
strings all at once.
I thought I didn't have to worry about memory allocation in
Python because of the garbage collector.

You don't have to worry about explicitly allocating memory, and you
almost never have to worry about explicitly freeing memory (unless you
are making objects that, directly or indirectly, contain themselves --
see below); but unless you have an infinite amount of RAM available of
course you can run out of memory if you use it all up :)
On this note I have a few
questions.  FYI I am using Python 2.6.4 on my Mac.
1.    When I pass a variable to the constructor of a class does it copy
that variable or is it just a reference/pointer?  I was under the
impression that it was just a pointer to the data.

Python's calling model is the same whether you pass to a class
constructor or any other function or method:

x = ["some", "data"]
obj = f(x)

The function f (which might be a class constructor) sees the exact same
list as you assigned to x -- the list is not copied first. However,
there's no promise made about what f does with that list -- it might copy
the list, or make one or more additional lists:

def f(a_list):
    another_copy = a_list[:]
    another_list = map(int, a_list)
2.    When do I need
to manually allocate/deallocate memory and when can I trust Python to
take care of it?

You never need to manually allocate memory.

You *may* need to deallocate memory if you make "reference loops", where
one object refers to itself:

l = []  # make an empty list
l.append(l)  # add the list l to itself

Python can break such simple reference loops itself, but for more
complicated ones, you may need to break them yourself:

a = []
b = {2: a}
c = (None, b)
d = [1, 'z', c]
a.append(d)  # a reference loop

Python will deallocate objects when they are no longer in use. They are
always considered in use any time you have them assigned to a name, or in
a list or dict or other structure which is in use.

You can explicitly remove a name with the del command. For example:

x = ['my', 'data']
del x

After deleting the name x, the list object itself is no longer in use
anywhere and Python will deallocate it. But consider:

x = ['my', 'data']
y = x  # y now refers to THE SAME list object
del x

Although you have deleted the name x, the list object is still bound to
the name y, and so Python will *not* deallocate the list.

Likewise:

x = ['my', 'data']
y = [None, 1, x, 'hello world']
del x

Although now the list isn't bound to a name, it is inside another list,
and so Python will not deallocate it.
3.    Any good practice suggestions?

Write small functions. Any temporary objects created by the function will
be automatically deallocated when the function returns.

Avoid global variables. They are a good way to inadvertently end up with
multiple long-lasting copies of data.

Try to keep data in one big piece rather than lots of little pieces.

But contradicting the above, if the one big piece is too big, it will be
hard for the operating system to swap it in and out of virtual memory,
causing thrashing, which is *really* slow. So aim for big, but not huge.

(By "big" I mean megabyte-sized; by "huge" I mean hundreds of megabytes.)

If possible, avoid reading the entire file in at once, and instead
process it line-by-line.

Hope this helps,

Wow, what a great bunch of responses. Thank you very much. If I
understand correctly the suggestions seem to be:
1. Write algorithms to read a file one line at a time instead of
reading the whole thing
2. Use lots of little functions so that memory can fall out of
scope.

You also confirmed what I thought was true that all variables are
passed "by reference" so I don't need to worry about the data being
copied (unless I do that explicitly).

Thanks!
Jeremy
 
T

Tim Chase

Aahz said:
Could you give an example of this? I'm not sure I
understand what you're saying.

Well, you're more likely to hit this by wrapping dbm with shelve (because
it's a little more obvious when you're using pickle directly), but
consider this:

d = anydbm.open(DB_NAME, "c")
x = MyClass()
d['foo'] = x
x.bar = 123

Your dbm does NOT have the change to x.bar recorded, you must do this
again:

d['foo'] = x

With a dict, you have Python's reference semantics.

Ah, that makes sense...fallout of the "dbm only does string
keys/values". It try to adhere to the "only use strings", so I'm
more cognizant of when I martial complex data-types in or out of
those strings. But I can see where it could bite a person.

Thanks,

-tkc
 
S

Steven D'Aprano

You also confirmed what I thought was true that all variables are passed
"by reference" so I don't need to worry about the data being copied
(unless I do that explicitly).

No, but yes.

No, variables are not passed by reference, but yes, you don't have to
worry about them being copied.

You have probably been mislead into thinking that there are only two
calling conventions possible, "pass by value" and "pass by reference".
That is incorrect. There are many different calling conventions, and
different groups use the same names to mean radically different things.

If a language passes variables by reference, you can write a "swap"
function like this:

def swap(a, b):
a, b = b, a

x = 1
y = 2
swap(x, y)
assert (x == 2) and (y==1)

But this does not work in Python, and cannot work without trickery. So
Python absolutely is not "pass by reference".

On the other hand, if a variable is passed by value, then a copy is made
and you can do this:

def append1(alist):
alist.append(1) # modify the copy
return alist

x = []
newlist = append1(x)
assert x == [] # The old value still exists.

But this also doesn't work in Python! So Python isn't "pass by value"
either.

What Python does is called "pass by sharing", or sometimes "pass by
object reference". It is exactly the same as what (e.g.) Ruby and Java
do, except that confusingly the Ruby people call it "pass by reference"
and the Java people call it "pass by value", thus guaranteeing the
maximum amount of confusion possible.


More here:
http://effbot.org/zone/call-by-object.htm
http://en.wikipedia.org/wiki/Evaluation_strategy
 
J

John Posner

You also confirmed what I thought was true that all variables are passed
"by reference" so I don't need to worry about the data being copied
(unless I do that explicitly).

No, but yes.

No, variables are not passed by reference, but yes, you don't have to
worry about them being copied.

You have probably been mislead into thinking that there are only two
calling conventions possible, "pass by value" and "pass by reference".
That is incorrect. There are many different calling conventions, and
different groups use the same names to mean radically different things.

If a language passes variables by reference, you can write a "swap"
function like this:

def swap(a, b):
a, b = b, a

x = 1
y = 2
swap(x, y)
assert (x == 2) and (y==1)

But this does not work in Python, and cannot work without trickery. So
Python absolutely is not "pass by reference".

On the other hand, if a variable is passed by value, then a copy is made
and you can do this:

def append1(alist):
alist.append(1) # modify the copy
return alist

x = []
newlist = append1(x)
assert x == [] # The old value still exists.

But this also doesn't work in Python! So Python isn't "pass by value"
either.

What Python does is called "pass by sharing", or sometimes "pass by
object reference". It is exactly the same as what (e.g.) Ruby and Java
do, except that confusingly the Ruby people call it "pass by reference"
and the Java people call it "pass by value", thus guaranteeing the
maximum amount of confusion possible.


More here:
http://effbot.org/zone/call-by-object.htm
http://en.wikipedia.org/wiki/Evaluation_strategy

Excellent writeup, Steve! You and Jeremy might be interested in a
message on "pass-by-XXX" from John Zelle, the author of the textbook
"Python Programming: An Introduction to Computer Science". [1] This was
part of a long thread in May 2008 on the Python edu-sig list -- nearly
as long as "Modifying Class Object", but with nowhere near the fireworks!

Tx,
John

[1] http://mail.python.org/pipermail/edu-sig/2008-May/008583.html
 
M

mk


Hmm how about "call by label-value"?

That is, you change labels by assignment, but pass the value of the
label to a function. Since label value is passed, original label is not
changed (i.e. it's not call by reference).

However, an object referenced by label value can be changed in Python,
like in classic example of list label passed to a function and then this
list being modified in a function.

Regards,
mk
 
E

Ethan Furman

mk said:

Hmm how about "call by label-value"?

That is, you change labels by assignment, but pass the value of the
label to a function. Since label value is passed, original label is not
changed (i.e. it's not call by reference).

However, an object referenced by label value can be changed in Python,
like in classic example of list label passed to a function and then this
list being modified in a function.

Regards,
mk

Because "value" and "reference" are already well defined terms with very
definite meanings, I think using them in any way to describe Python's
model will lead to confusion.

Seems to me that "call by object", a term coined decades ago, and that
accurately defines the way that Python (the language) actually does it,
should be the term used.

My $0.02.

~Ethan~
 
A

Alf P. Steinbach

* Christian Heimes:
mk said:
Hmm how about "call by label-value"?

Or "call by guido"? How do you like "call like a dutch"? :]

Just a note: it might be more clear to talk about "pass by XXX" than "call by XXX".

Unless you're talking about something else than argument passing.

The standard terminology is in my view fine.


Cheers & hth.,

- Alf
 
A

Antoine Pitrou

Le Fri, 12 Feb 2010 17:14:57 +0000, Steven D'Aprano a écrit :
What Python does is called "pass by sharing", or sometimes "pass by
object reference". It is exactly the same as what (e.g.) Ruby and Java
do, except that confusingly the Ruby people call it "pass by reference"
and the Java people call it "pass by value", thus guaranteeing the
maximum amount of confusion possible.

Well, I think the Ruby people got it right. Python *does* pass parameters
by reference. After all, we even speak about reference counting,
reference cycles, etc.

So I'm not sure what distinction you're making here.
 
S

Steve Holden

Antoine said:
Le Fri, 12 Feb 2010 17:14:57 +0000, Steven D'Aprano a écrit :

Well, I think the Ruby people got it right. Python *does* pass parameters
by reference. After all, we even speak about reference counting,
reference cycles, etc.

So I'm not sure what distinction you're making here.
He's distinguishing what Python does from the "call by reference" which
has been used since the days of Algol 60.

As has already been pointed out, if Python used call by reference then
the following code would run without raising an AssertionError:

def exchange(a, b):
a, b = b, a

x = 1
y = 2
exchange(x, y)
assert (x == 2 and y == 1)

Since function-local assignment always takes place in the function
call's local namespace Python does not, and cannot, work like this, and
hence the term "call by reference" is inapplicable to Python's semantics.

regards
Steve
 
S

Steve Holden

Antoine said:
Le Fri, 12 Feb 2010 17:14:57 +0000, Steven D'Aprano a écrit :

Well, I think the Ruby people got it right. Python *does* pass parameters
by reference. After all, we even speak about reference counting,
reference cycles, etc.

So I'm not sure what distinction you're making here.
He's distinguishing what Python does from the "call by reference" which
has been used since the days of Algol 60.

As has already been pointed out, if Python used call by reference then
the following code would run without raising an AssertionError:

def exchange(a, b):
a, b = b, a

x = 1
y = 2
exchange(x, y)
assert (x == 2 and y == 1)

Since function-local assignment always takes place in the function
call's local namespace Python does not, and cannot, work like this, and
hence the term "call by reference" is inapplicable to Python's semantics.

regards
Steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top