Reclaiming (lots of) memory

T

Thomas Rast

Hello everyone

My scenario is somewhat involved, so if you're in a hurry, here's an
abstract: I have a daemon that needs about 80MB of RAM to build its
internal data structures, but can pack them to 20MB for the rest of
its lifetime. The other 60MB are never returned to the operating
system. I've thought of building and packing the structures in a
fork()ed process, then piping them over to the main part; is there an
easier way to get my RAM back?

Explanation:

The program is part of a toy IRC bot which reads roughly 100'000 lines
from logfiles and builds a huge dict of dicts which is then used to
build sentences much like word-based dissociated press would. All
entries are ints representing words or, as values of the inner
dictionaries, frequencies.

The final dictionary has about 325'000 entries, and the python process
uses around 80MB of RAM. See the test case at the end of my post:
fill() creates a similiar data structure. If you run it as a script
and compare the 'ps' outputs, you'll see that 'del d' frees only a
small part of the memory to the OS; on my Linux system, about 10%.

Using the standard 'struct' module, I can pack the keys and values of
the outer dictionary into strings, which brings memory usage down to
about 20M. Unfortunately, this has to be done as a second step (as
opposed to always keeping the dictionary in that format), otherwise it
would slow down things too much. Writing everything to disk (using
e.g. 'bsddb3') suffers from the same problem; I really have to do the
initial work in RAM to get acceptable speed.

So, do I really have to do this in two separate processes? Would it
help if I implemented the data storage part as a C extension module?
Or am I worrying too much about a problem that is better left to the
OS's memory management (i.e. swapping)?

Of course I'd also appreciate ideas for more efficient data layouts
;-)

Thomas

---8<---
#!/usr/bin/python

import random
import os

def fill(map):
random.seed(0)
for i in xrange(300000):
k1 = (random.randrange(100),
random.randrange(100),
random.randrange(100))
k2 = random.randrange(100)
if k1 in map:
d = map[k1]
else:
d = dict()
d[k2] = d.get(k2, 0) + 1
map[k1] = d

if __name__ == "__main__":
os.system('ps v')
d = dict()
fill(d)
os.system('ps v')
del d
os.system('ps v')
--->8---
 
P

Paul Rubin

Thomas Rast said:
So, do I really have to do this in two separate processes? Would it
help if I implemented the data storage part as a C extension module?
Or am I worrying too much about a problem that is better left to the
OS's memory management (i.e. swapping)?

Of course I'd also appreciate ideas for more efficient data layouts

When faced with such problems I always find it worthwhile to remember
that the phone company managed to issue itemized long-distance bills
to each of its customers, even in the 1960's using mainframe computers
with a few hundred kb of memory and a room full of magtape drives like
you see in old movies. I ask myself "how would a 1960's mainframe
programmer do this, without megabytes of ram?".

I'm not sure what your dictionary of dictionaries is supposed to do,
but from your dissociated press description it sounds like you want to
look at the pairs of adjacent pairs of words and build up a
probability table. In that case I'd do something like:

1. Sequentially read the logs files, and for each pair of adjacent words,
write that pair to a temp file tmp1.
2. Sort tmp1 with os.system("sort").
3. Make a sequential pass through tmp1 to build a second temp file tmp2,
which is a compressed tmp1. For example, if tmp1 has the lines
apple pie
apple computer
apple pie
apple sauce

then make a single line in tmp2:

apple: pie:2,computer:1,sauce:1

Do that in the obvious way by building a temporary dict for the words
that follow "apple". Also, while doing this, build a Python dict
giving the offset in tmp2 for the line for each word, i.e.

d['apple'] = ofile.tell()
and similarly for all the other words (you have a Python dict with
325k entries which is each a single int; that shouldn't be too large).
There are of course lots of tradeoffs you can make to give a smaller
dict, if you need to. You could also use the array module to implement
an int-valued hash table without the overhead of a Python object in each slot.

4. Use os.mmap to map the file tmp2 into memory. Then to get the
successor words to 'apple', simply use the in-memory dict to get the
appropriate offset in tmp2, read and parse the line from that offset,
and emit the appropriate word.

5. There's lots of other optimization and compression you can do to
make tmp2 smaller, which can reduce your paging load, but the above
should be enough to get you started.
 
T

Thomas Rast

Paul Rubin said:
I'm not sure what your dictionary of dictionaries is supposed to do,
but from your dissociated press description it sounds like you want to
look at the pairs of adjacent pairs of words and build up a
probability table. In that case I'd do something like:
[... use the 'sort' utility on a temporary file ...]

Thanks a lot for your input! I'm still not sure what exactly I'm
going to do, since I'd also like to update the tables as new lines
come in, but I may just have to regenerate them periodically. In any
case, your post certainly gives me some ideas.

- Thomas
 
N

Nick Craig-Wood

Thomas Rast said:
My scenario is somewhat involved, so if you're in a hurry, here's an
abstract: I have a daemon that needs about 80MB of RAM to build its
internal data structures, but can pack them to 20MB for the rest of
its lifetime. The other 60MB are never returned to the operating
system.

First a recap on how memory allocation works! This is general - I
don't know exactly how the python memory allocation works, but there
aren't too many ways of doing it under linux.

The way malloc() and friends work under linux is to claim large blocks
of memory from the OS. This is either done with brk()/sbrk() or with
mmmap("/dev/zero") (depending on exactly which version of glibc you
are using etc).

For brk()/sbrk() malloc keeps a single area which grows and grows.
This area *can* be shrunk, but only if there is nothing that was
allocated in the way.

mmap("/dev/zero") behaves in pretty much the same way (ie it shrinks
and grows), except for large allocation (>4k I think) the allocation
is done in its own mmap() area. This means that when it is freed it
completely disappears.

So for either of these methods freeing memory back to the OS relies on
the fact that there has been nothing allocated that will stop the
memory shrinking back down all the way. In a dynamic object language
like python thats pretty difficult to guarantee - perhaps impossible!

The one ray of hope is the mmap("/dev/zero") for blocks bigger than
4k. I'm pretty sure this is how modern glibc's work by default.

What you really want to do is get the allocation of the initial dict
and all of its data to be in its own mmap() block. No idea whether
this is possible in python though! Its the kind of thing you could do
in C++ if you could be bothered to wade through all the tedious syntax
;-)

When I've had this sort of problem in the past I've usually resorted
to an on disk database (dbm of some kind). These are usually fast
enough. It would have the advantage that you'd only have to parse the
log files once too. They are reasonably fast to access too as all the
hot pages soon get into the OS page cache.

Also as suggested by another poster /usr/bin/sort is very, very good
at huge datasets using minimal memory.
I've thought of building and packing the structures in a fork()ed
process, then piping them over to the main part;

Thats a neat idea. You could also build a dbm in the forked process.
is there an easier way to get my RAM back?

I don't think so, but hopefully an expert who knows something about
how python allocates its objects will speak up now!

....

Just for fun I converted your program to use anydbm. It then took 135
seconds to run (vs 12 for your original code), but only used 4 MB of
memory top. The values of the dbm are pickled hashes. The keys could
be too, but I decided to just use a tab seperated string...

import random
import os
import anydbm
import cPickle

def fill(map):
random.seed(0)
for i in xrange(300000):
k1 = "%s\t%s\t%s" % (random.randrange(100),
random.randrange(100),
random.randrange(100))
k2 = random.randrange(100)
if k1 in map:
d = cPickle.loads(map[k1])
else:
d = dict()
d[k2] = d.get(k2, 0) + 1
map[k1] = cPickle.dumps(d, -1)

if __name__ == "__main__":
show_me = 'ps v %d' % os.getpid()
os.system(show_me)
d = anydbm.open("ztest", 'n')
fill(d)
os.system(show_me)
del d
os.system(show_me)
 
P

Paul Rubin

Nick Craig-Wood said:
Just for fun I converted your program to use anydbm. It then took 135
seconds to run (vs 12 for your original code), but only used 4 MB of
memory top. The values of the dbm are pickled hashes. The keys could
be too, but I decided to just use a tab seperated string...

But I think it used much more than 4 MB of memory if you count the
amount of system cache that held that dbm file. Without the caching
there'd have had to be real disk seeks for all those dbm updates. So
I think if you had a lot more data, enough that you couldn't keep the
dbm in cache any more, you'd get a horrible slowdown. That's why you
may be better off with a sorting-based method, that only needs
sequential disk operations and not a lot of random seeks.
 
N

Nick Craig-Wood

Paul Rubin said:
But I think it used much more than 4 MB of memory if you count the
amount of system cache that held that dbm file.

The dbm file is only 10 MB though, so thats 10MB of memory extra used
which *is* reclaimable!
Without the caching there'd have had to be real disk seeks for all
those dbm updates. So I think if you had a lot more data, enough
that you couldn't keep the dbm in cache any more, you'd get a
horrible slowdown.

You've only got to keep the index of the dbm in cache and its designed
for that purpose.

You do get lots of disk io when making the dbm though - I suspect it
does a lot of fsync() - you may be able to turn that off.
That's why you may be better off with a sorting-based method, that
only needs sequential disk operations and not a lot of random
seeks.

A sorting based method will be better for huge datasets that are only
compiled once. I would have thought a dbm would win for lots of
updates and medium sized datasets.
 
T

Thomas Rast

Nick Craig-Wood said:
mmap("/dev/zero") behaves in pretty much the same way (ie it shrinks
and grows), except for large allocation (>4k I think) the allocation
is done in its own mmap() area. This means that when it is freed it
completely disappears.

Ok. That explains why e.g.
l = [0] * 10000000 # 10 million
del l

frees the memory used for l, as expected.
Thats a neat idea. You could also build a dbm in the forked
process.

I've implemented it that way now. Using a pipe to transfer the data
to the dbm appears to essentially double the I/O overhead and is very
inefficient.
I don't think so, but hopefully an expert who knows something about
how python allocates its objects will speak up now!

Well, can't blame Python if you'd need black magic even in C ;-)

Thanks for the explanations!

Thomas
 
T

Thomas Rast

Nick Craig-Wood said:
The dbm file is only 10 MB though, so thats 10MB of memory extra used
which *is* reclaimable!

Just to clear this up, the problem isn't that the process takes too
much memory while initializing. The machine that runs it has 256MB
RAM, and for all I care it can fill all available memory (when the
logfiles have grown to about four times what they are today, it
will...). The problem was that it wouldn't free *unused* RAM after
initialization.

Anyway, it works now. Thanks for your help Nick and Paul!

- Thomas
 
A

Andrew Dalke

Thomas said:
I've implemented it that way now. Using a pipe to transfer the data
to the dbm appears to essentially double the I/O overhead and is very
inefficient. ...
Well, can't blame Python if you'd need black magic even in C ;-)

You could also try using shared memory. There's the raw
layer, like http://0pointer.de/lennart/projects/libshbuf/
or a higher layer like POSH ( http://poshmodule.sourceforge.net/ ).

I've used neither. The letter only works on Intel (and
perhaps only Linux) because it does locking with some inline
asm.

Andrew
(e-mail address removed)
 
J

James Kew

Thomas Rast said:
Just to clear this up, the problem isn't that the process takes too
much memory while initializing. [...]
The problem was that it wouldn't free *unused* RAM after
initialization.

Is this really a problem though, if the process is long-running after
initialisation? If the pages are unused after initialisation, won't they
eventually get swapped out to disk making the physical RAM available
elsewhere?

Or is this too simple a view?

James
 
L

Luis P Caamano

James Kew said:
Thomas Rast said:
Just to clear this up, the problem isn't that the process takes too
much memory while initializing. [...]
The problem was that it wouldn't free *unused* RAM after
initialization.

Is this really a problem though, if the process is long-running after
initialisation? If the pages are unused after initialisation, won't they
eventually get swapped out to disk making the physical RAM available
elsewhere?


Why swap out 60MB of "unused" memory??? That's a waste of CPU cycles
and disk space.

What if you wanted to run his program on a machine with 128MB of RAM
and NO swap or extremely slow swap. If it freed the memory it would
run nicely with other apps, if it doesn't return the memory to the OS,
it might crash if the system needs to swap or the app might fail if
everything slows down too much while swapping those 60MB of unused
memory.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,045
Latest member
DRCM

Latest Threads

Top