a huge shared read-only data in parallel accesses -- How?multithreading? multiprocessing?

Valery · Dec 9, 2009

Hi all,

Q: how to organize parallel accesses to a huge common read-only Python
data structure?

Details:

I have a huge data structure that takes >50% of RAM.
My goal is to have many computational threads (or processes) that can
have an efficient read-access to the huge and complex data structure.

"Efficient" in particular means "without serialization" and "without
unneeded lockings on read-only data"

To what I see, there are following strategies:

1. multi-processing
=> a. child-processes get their own *copies* of huge data structure
-- bad and not possible at all in my case;
=> b. child-processes often communicate with the parent process via
some IPC -- bad (serialization);
=> c. child-processes access the huge structure via some shared
memory approach -- feasible without serialization?! (copy-on-write is
not working here well in CPython/Linux!!);

2. multi-threading
=> d. CPython is told to have problems here because of GIL -- any
comments?
=> e. GIL-less implementations have their own issues -- any hot
recommendations?

I am a big fan of parallel map() approach -- either
multiprocessing.Pool.map or even better pprocess.pmap. However this
doesn't work straight-forward anymore, when "huge data" means >50%
RAM
;-)

Comments and ideas are highly welcome!!

Here is the workbench example of my case:

######################
import time
from multiprocessing import Pool
def f(_):
time.sleep(5) # just to emulate the time used by my
computation
res = sum(parent_x) # my sofisticated formula goes here
return res

if __name__ == '__main__':
parent_x = [1./i for i in xrange(1,10000000)]# my huge read-
only data

)
p = Pool(7)
res= list(p.map(f, xrange(10)))
# switch to ps and see how fast your free memory is getting
wasted...
print res
######################

Kind regards
Valery

Klauss · Dec 9, 2009

Hi all,

Q: how to organize parallel accesses to a huge common read-only Python
data structure?

Details:

I have a huge data structure that takes >50% of RAM.
My goal is to have many computational threads (or processes) that can
have an efficient read-access to the huge and complex data structure.

<snip>

1. multi-processing
=> a. child-processes get their own *copies* of huge data structure
-- bad and not possible at all in my case;

How's the layout of your data, in terms # of objects vs. bytes used?
Just to have an idea of the overhead involved in refcount
externalization (you know, what I mentioned here:
http://groups.google.com/group/unladen-swallow/browse_thread/thread/9d2af1ac3628dc24
)

Valery · Dec 9, 2009

Hi Klauss,

How's the layout of your data, in terms # of objects vs. bytes used?

dict (or list) of 10K-100K objects. The objects are lists or dicts.
The whole structure eats up to 2+ Gb RAM

Just to have an idea of the overhead involved in refcount
externalization (you know, what I mentioned here:http://groups.google.com/group/unladen-swallow/browse_thread/thread/9...
)

yes, I've understood the idea explained by you there.

regards,
Valery

Emile van Sebille · Dec 9, 2009

On 12/9/2009 6:58 AM Valery said...

Hi all,

Q: how to organize parallel accesses to a huge common read-only Python
data structure?

I have such a structure which I buried in a zope process which keeps it
in memory and is accessed through http requests. This was done about 8
years ago, and I think today I'd check out pyro.

Emile

Aaron Watters · Dec 9, 2009

Hi all,

Q: how to organize parallel accesses to a huge common read-only Python
data structure?

Use a BTree on disk in a file. A good file system will keep most of
the
pages you need in RAM whenever the data is "warm". This works
for Python or any other programming language. Generally you can
always get to any piece of data in about 4 seeks at most anyway,
so if your disk is fast your app will be fast too. The file can
be accessed concurrently without problems by any number of processes
or threads.

-- Aaron Watters
http://listtree.appspot.com
http://whiffdoc.appspot.com
===
less is more

Antoine Pitrou · Dec 11, 2009

Le Wed, 09 Dec 2009 06:58:11 -0800, Valery a Ã©critÂ :

I have a huge data structure that takes >50% of RAM. My goal is to have
many computational threads (or processes) that can have an efficient
read-access to the huge and complex data structure.

"Efficient" in particular means "without serialization" and "without
unneeded lockings on read-only data"

I was going to suggest memcached but it probably serializes non-atomic
types. It doesn't mean it will be slow, though. Serialization implemented
in C may well be faster than any "smart" non-serializing scheme
implemented in Python.

2. multi-threading
=> d. CPython is told to have problems here because of GIL -- any
comments?

What do you call "problems because of the GIL"? It is quite a vague
statement, and an answer would depend on your OS, the number of threads
you're willing to run, and whether you want to extract throughput from
multiple threads or are just concerned about latency.

In any case, you have to do some homework and compare the various
approaches on your own data, and decide whether the numbers are
satisfying to you.

I am a big fan of parallel map() approach

I don't see what map() has to do with accessing data. map() is for
*processing* of data. In other words, whether or not you use a map()-like
primitive does not say anything about how the underlying data should be
accessed.

Klauss · Dec 14, 2009

I was going to suggest memcached but it probably serializes non-atomic
types.

Atomic as well.
memcached communicates through sockets[3] (albeit possibly unix
sockets, which are faster than TCP ones).

multiprocessing has shared memory schemes, but does a lot of internal
copying (uses ctypes)... and are particularly unhelpful when your
shared data is highly structured, since you can't share objects, only
primitive types.

I finished a patch that pushes reference counters into packed pools.
It has lots of drawbacks, but manages to solve this particular
problem, if the data is prominently non-numeric (ie: lists and dicts,
as mentioned before). Of the drawbacks, perhaps the bigger is a bigger
memory footprint - yep... I don't believe there's anything that can be
done to change that. It can be optimized, to make the overhead a
little less though.

This test code[1] consumes roughly 2G of RAM on an x86_64 with python
2.6.1, with the patch, it *should* use 2.3G of RAM (as specified by
its output), so you can see the footprint overhead... but better page
sharing makes it consume about 6 times less - roughly 400M... which is
the size of the dataset. Ie: near-optimal data sharing.

This patch[2] has other optimizations intermingled - if there's
interest in the patch without those (which are both unproven and
nonportable) I could try to separate them. I will have to, anyway, to
upload for inclusion into CPython (if I manage to fix the
shortcomings, and if it gets approved).

The most important shortcomings of the refcount patch are:
1) Tripled memory overhead of reference counting. Before, it was a
single Py_ssize_t per object. Now, it's two pointers plus the
Py_ssize_t. This could perhaps be optimized (by getting rid of the
arena pointer, for instance).
2) Increased code output for Py_INCREF/DECREF. It's small, but it
adds up to a lot. Timings on test_decimal.py (a small numeric
benchmark I use, which might not be representative at all) shows a 10%
performance loss in CPU time. Again, this might be optimized with a
lot of work and creativity.
3) Breaks binary compatibility, and in weird cases source
compatibility with extension modules. PyObject layout is different, so
statically-initialized variables need to stick to using CPython's
macros (I've seen cases when they don't), and code should use Py_REFCNT
() for accessing the refcount, but many just do ob->ob_refcnt, which
will break with the patch.
4) I'm also not really sure (haven't tested) what happens when
CPython runs out of memory - I tried real hard not to segfault, even
recover nicely, but you know how hard that is...

[3] http://code.google.com/p/memcached/...mpare_to_a_server_local_cache?_(PHP's_APC,_mm
[2] http://www.deeplayer.com/claudio/misc/Python-2.6.1-refcount.patch
[1] test code below

import time
from multiprocessing import Pool

def usoMemoria():
import os
import subprocess
pid = os.getpid()
cmd = "ps -o vsz=,rss=,share= -p %s --ppid %s" % (pid,pid)
p = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE)
info = p.stdout.readlines()
s = sum( int(r) for v,r,s in map(str.split,map(str.strip, info)) )
return s

def f(_):
return sum(int(x) for d in huge_global_data for x in d if x !=
"state") # my sofisticated formula goes here

if __name__ == '__main__':
huge_global_data = []
for i in xrange(500000):
d = {}
d[str(i)] = str(i*10)
d[str(i+1)] = str(i)
d["state"] = 3
huge_global_data.append(d)

p = Pool(7)
res= list(p.map(f, xrange(20)))

print "%.2fM" % (usoMemoria() / 1024.0)

Valery · Dec 26, 2009

Hi Antoine

I was going to suggest memcached but it probably serializes non-atomic
types. It doesn't mean it will be slow, though. Serialization implemented
in C may well be faster than any "smart" non-serializing scheme
implemented in Python.

No serializing could be faster than NO serializing at all

If child process could directly read the parent RAM -- what could be
better?

What do you call "problems because of the GIL"? It is quite a vague
statement, and an answer would depend on your OS, the number of threads
you're willing to run, and whether you want to extract throughput from
multiple threads or are just concerned about latency.

it seems to be a known fact, that only one CPython iterpreter will be
running at a time, because a thread is aquiring the GIL during the
execution and other threads within same process are then just waiting
for GIL to be released.

In any case, you have to do some homework and compare the various
approaches on your own data, and decide whether the numbers are
satisfying to you.

well, I the least evil is to pack-unpack things into array.array and/
or similarly NumPy.

I do hope that Klauss' patch will be accepted, because it will let me
to forget a lot of those unneeded packing-unpacking.

I don't see what map() has to do with accessing data. map() is for
*processing* of data. In other words, whether or not you use a map()-like
primitive does not say anything about how the underlying data should be
accessed.

right. However, saying "a big fan" has had another focus here: if you
write your code based on maps then you have a tiny effort to convert
your code into a MULTIprocessing one

just that.

Kind regards.
Valery

garyrob · Dec 31, 2009

One thing I'm not clear on regarding Klauss' patch. He says it's
applicable where the data is primarily non-numeric. In trying to
understand why that would be the case, I'm thinking that the increased
per-object memory overhead for reference-counting would outweigh the
space gains from the shared memory.

Klauss's test code stores a large number of dictionaries which each
contain just 3 items. The stored items are strings, but short ones...
it looks like they take up less space than double floats(?).

So my understanding is that the point is that the overhead for the
dictionaries is big enough that the patch is very helpful even though
the stored items are small. And that the patch would be less and less
effective as the number of items stored in each dictionary became
greater and greater, until eventually the patch might do more use more
space for reference counting than it saved by shared memory.

Is this understanding correct? (I'm hoping not, because for some
applications, I'd like to be able to use it for large dictionaries
containing lots of numbers.)

Thanks,
Gary

Klauss · Jan 7, 2010

One thing I'm not clear on regarding Klauss' patch. He says it's
applicable where the data is primarily non-numeric. In trying to
understand why that would be the case, I'm thinking that the increased
per-object memory overhead for reference-counting would outweigh the
space gains from the shared memory.

Klauss's test code stores a large number of dictionaries which each
contain just 3 items. The stored items are strings, but short ones...
it looks like they take up less space than double floats(?).

So my understanding is that the point is that the overhead for the
dictionaries is big enough that the patch is very helpful even though
the stored items are small. And that the patch would be less and less
effective as the number of items stored in each dictionary became
greater and greater, until eventually the patch might do more use more
space for reference counting than it saved by shared memory.

Not really.
The real difference is that numbers (ints and floats) are allocated
out of small contiguous pools. So even if a great percentage of those
objects would remain read-only, there's probably holes in those pools
left by the irregular access pattern during initialization, and those
holes would be written to eventually as the pool gets used.

In essence, those pools aren't read-only for other reasons than
reference counting.

Dictionaries, tuples and lists (and many other types) don't exhibit
that behavior.

Parallel/Multiprocessing script design question	4	Sep 13, 2007
Multiprocessing, shared memory vs. pickled copies	21	Apr 4, 2011
mmap thoughts	1	May 11, 2007
Rvolumeraduis2 c3po "Theroy of everything"	0	Feb 18, 2005
one infinite leap and you thought the bible code was somthing!!	0	Feb 1, 2005
Asp.net Important Topics.	0	Jan 18, 2007
Rvolumeraduis2 c3po "Theroy of everything"	0	Feb 18, 2005
Download the JAVA , .NET and SQL Server interview PDF	0	Sep 17, 2006

a huge shared read-only data in parallel accesses -- How?multithreading? multiprocessing?

Valery

Klauss

Valery

Emile van Sebille

Aaron Watters

Antoine Pitrou

Klauss

Valery

garyrob

Klauss

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads