Is there any way to minimize str()/unicode() objects memory usage[Python 2.6.4] ?

dmtr · Aug 7, 2010

I'm running into some performance / memory bottlenecks on large lists.
Is there any easy way to minimize/optimize memory usage?

Simple str() and unicode objects() [Python 2.6.4/Linux/x86]:
Lists of str() and unicode() objects (see ref. code below):

[str(i) for i in xrange(0, 10000000)] 370 Mb (37 bytes/item)
[unicode(i) for i in xrange(0, 10000000)] 613 Mb (63 bytes/item)

Click to expand...

Click to expand...

Well... 63 bytes per item for very short unicode strings... Is there
any way to do better than that? Perhaps some compact unicode objects?

-- Regards, Dmitry

----
import os, time, re
start = time.time()
l = [unicode(i) for i in xrange(0, 10000000)]
dt = time.time() - start
vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
os.getpid()).read())
print "%d keys, %s, %f seconds, %f keys per second" % (len(l), vm, dt,
len(l) / dt)

Steven D'Aprano · Aug 7, 2010

I'm running into some performance / memory bottlenecks on large lists.
Is there any easy way to minimize/optimize memory usage?

Yes, lots of ways. For example, do you *need* large lists? Often a better
design is to use generators and iterators to lazily generate data when
you need it, rather than creating a large list all at once.

An optimization that sometimes may help is to intern strings, so that
there's only a single copy of common strings rather than multiple copies
of the same one.

Can you compress the data and use that? Without knowing what you are
trying to do, and why, it's really difficult to advise a better way to do
it (other than vague suggestions like "use generators instead of lists").

Very often, it is cheaper and faster to just put more memory in the
machine than to try optimizing memory use. Memory is cheap, your time and
effort is not.

[...]

Well... 63 bytes per item for very short unicode strings... Is there
any way to do better than that? Perhaps some compact unicode objects?

If you think that unicode objects are going to be *smaller* than byte
strings, I think you're badly informed about the nature of unicode.

Python is not a low-level language, and it trades off memory compactness
for ease of use. Python strings are high-level rich objects, not merely a
contiguous series of bytes. If all else fails, you might have to use
something like the array module, or even implement your own data type in
C.

But as a general rule, as I mentioned above, the best way to minimize the
memory used by a large list is to not use a large list. I can't emphasise
that enough -- look into generators and iterators, and lazily handle your
data whenever possible.

Thomas Jollans · Aug 7, 2010

I'm running into some performance / memory bottlenecks on large lists.
Is there any easy way to minimize/optimize memory usage?

Simple str() and unicode objects() [Python 2.6.4/Linux/x86]:
Lists of str() and unicode() objects (see ref. code below):

[str(i) for i in xrange(0, 10000000)] 370 Mb (37 bytes/item)
[unicode(i) for i in xrange(0, 10000000)] 613 Mb (63 bytes/item)

Click to expand...

Click to expand...

Well... 63 bytes per item for very short unicode strings... Is there
any way to do better than that? Perhaps some compact unicode objects?

There is a certain price you pay for having full-feature Python objects.

What are you trying to accomplish anyway? Maybe the array module can be
of some help. Or numpy?

-- Regards, Dmitry

----
import os, time, re
start = time.time()
l = [unicode(i) for i in xrange(0, 10000000)]
dt = time.time() - start
vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
os.getpid()).read())
print "%d keys, %s, %f seconds, %f keys per second" % (len(l), vm, dt,
len(l) / dt)

dmtr · Aug 7, 2010

Steven, thank you for answering. See my comments inline. Perhaps I
should have formulated my question a bit differently: Are there any
*compact* high performance containers for unicode()/str() objects in
Python? By *compact* I don't mean compression. Just optimized for
memory usage, rather than performance.

What I'm really looking for is a dict() that maps short unicode
strings into tuples with integers. But just having a *compact* list
container for unicode strings would help a lot (because I could add a
__dict__ and go from it).

Yes, lots of ways. For example, do you *need* large lists? Often a better
design is to use generators and iterators to lazily generate data when
you need it, rather than creating a large list all at once.

Yes. I do need to be able to process large data sets.
No, there is no way I can use an iterator or lazily generate data when
I need it.

An optimization that sometimes may help is to intern strings, so that
there's only a single copy of common strings rather than multiple copies
of the same one.

Unfortunately strings are unique (think usernames on facebook or
wikipedia). And I can't afford storing them in db/memcached/redis/
etc... Too slow.

Can you compress the data and use that? Without knowing what you are
trying to do, and why, it's really difficult to advise a better way to do
it (other than vague suggestions like "use generators instead of lists").

Yes. I've tried. But I was unable to find a good, unobtrusive way to
do that. Every attempt either adds some unnecessary pesky code, or
slow, or something like that. See more at: http://bugs.python.org/issue9520

Very often, it is cheaper and faster to just put more memory in the
machine than to try optimizing memory use. Memory is cheap, your time and
effort is not.

Well... I'd really prefer to use say 16 bytes for 10 chars strings and
fit data into 8Gb
Rather than paying extra $1k for 32Gb.

If you think that unicode objects are going to be *smaller* than byte
strings, I think you're badly informed about the nature of unicode.

I don't think that that unicode objects are going to be *smaller*!
But AFAIK internally CPython uses UTF-8? No? And 63 bytes per item
seems a bit excessive.
My question was - is there any way to do better than that....

Python is not a low-level language, and it trades off memory compactness
for ease of use. Python strings are high-level rich objects, not merely a
contiguous series of bytes. If all else fails, you might have to use
something like the array module, or even implement your own data type in
C.

Are there any *compact* high performance containers (with dict, list
interface) in Python?

-- Regards, Dmitry

dmtr · Aug 7, 2010

Well... 63 bytes per item for very short unicode strings... Is there

There is a certain price you pay for having full-feature Python objects.

Are there any *compact* Python objects? Optimized for compactness?

What are you trying to accomplish anyway? Maybe the array module can be
of some help. Or numpy?

Ultimately a dict that can store ~20,000,000 entries: (u'short
string' : (int, int, int, int, int, int, int)).

-- Regards, Dmitry

Chris Rebert · Aug 7, 2010

On Fri said:
I don't think that that unicode objects are going to be *smaller*!
But AFAIK internally CPython uses UTF-8?

Nope. unicode objects internally use UCS-2 or UCS-4, depending on how
CPython was ./configure-d; the former is the default.
See PEP 261.

Cheers,
Chris

Neil Hodgson · Aug 7, 2010

dmtr:

What I'm really looking for is a dict() that maps short unicode
strings into tuples with integers. But just having a *compact* list
container for unicode strings would help a lot (because I could add a
__dict__ and go from it).

Add them all into one string or array and use indexes into that string.

Neil

Carl Banks · Aug 7, 2010

Are there any *compact* Python objects? Optimized for compactness?

Yes, but probably not in the way that'd be useful to you.

Look at the array module, and also consider the third-party numpy
library. They store compact arrays of numeric types (mostly) but they
have character type storage as well. That probably won't help you,
though, since you have variable-length strings.

I don't know of any third-party types that can do what you want, but
there might be some. Search PyPI.

Ultimately a dict that can store ~20,000,000 entries: (u'short
string' : (int, int, int, int, int, int, int)).

My recommendation would be to use sqlite3. Only if you know for sure
that it's too slow--meaning that you've actually tried it and it was
too slow, and nothing else--then should you bother with a

For that I'd probably go with a binary tree rather than a hash. So
you have a huge numpy character array that stores all 20 million short
strings end-to-end (in lexical order, so that you can look up the
strings with a binary search), then you have an numpy integer array
that stores the indices into this string where the word boundaries
are, and then an Nx7 numpy integer array storing the int return
vslues. That's three compact arrays.

Carl Banks

Michael Torrie · Aug 7, 2010

Ultimately a dict that can store ~20,000,000 entries: (u'short
string' : (int, int, int, int, int, int, int)).

I think you really need a real database engine. With the proper
indexes, MySQL could be very fast storing and retrieving this
information for you. And it will use your RAM to cache as it sees fit.
Don't try to reinvent the wheel here.

dmtr · Aug 7, 2010

I think you really need a real database engine. With the proper
indexes, MySQL could be very fast storing and retrieving this
information for you. And it will use your RAM to cache as it sees fit.
Don't try to reinvent the wheel here.

No, I've tried. DB solutions are not even close in terms of the speed.
Processing would take weeks

Memcached or REDIS sort of work, but
they are still a bit on the slow side, to be a pleasure to work with.
The standard dict() container is *a lot* faster. It is also hassle
free (accepting unicode keys/etc). I just wish there was a bit more
compact dict container, optimized for large dataset and memory, not
for speed. And with the default dict() I'm also running into some kind
of nonlinear performance degradation, apparently after
10,000,000-13,000,000 keys. But I can't recreate this with a solid
test case (see http://bugs.python.org/issue9520 )

-- Dmitry

garabik-news-2005-05 · Aug 7, 2010

dmtr said:
What I'm really looking for is a dict() that maps short unicode
strings into tuples with integers. But just having a *compact* list
container for unicode strings would help a lot (because I could add a
__dict__ and go from it).

At this point, I'd suggest to use one of the dbm modules, and pack the
integers with struct.pack into a short string(s).
Depending on your usage pattern, there are marked performance
differences between dbhash, gdbm, and dbm implementations, so perhaps
it would pay off to invest sometime in benchmarking.
If your data are write-once, then cdb has excellent performance (but a
different API).
The file will be usually cached in RAM, so no need to worry about I/O
bottlenecks... and if small enough, you can always put it into a
ramdisk.

If your strings are long enough, you can improve memory usage with
a use of zlib.compress (dumb and unoptimal way of using compression, but
easy and present in the std library) - but always verify if the
compressed strings are _shorter_ than originals.

--
-----------------------------------------------------------
| Radovan GarabÃk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Peter Otten · Aug 7, 2010

dmtr said:
Are there any *compact* Python objects? Optimized for compactness?

Ultimately a dict that can store ~20,000,000 entries: (u'short
string' : (int, int, int, int, int, int, int)).

I don't know to what extent it still applys but switching off cyclic garbage
collection with

import gc
gc.disable()

while building large datastructures used to speed up things significantly.
That's what I would try first with your real data.

Encoding your unicode strings as UTF-8 could save some memory.

When your integers fit into two bytes, say, you can use an array.array()
instead of the tuple.

Peter

dmtr · Aug 7, 2010

I don't know to what extent it still applys but switching off cyclic garbage
collection with

import gc
gc.disable()

Haven't tried it on the real dataset. On the synthetic test it (and
sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory
usage. Not significant. I'll try it on the real dataset though.

while building large datastructures used to speed up things significantly..
That's what I would try first with your real data.

Encoding your unicode strings as UTF-8 could save some memory.

Yes... In fact that's what I'm trying now... .encode('utf-8')
definitely creates some clutter in the code, but I guess I can
subclass dict... And it does saves memory! A lot of it. Seems to be a
bit faster too....

When your integers fit into two bytes, say, you can use an array.array()
instead of the tuple.

Excellent idea. Thanks! And it seems to work too, at least for the
test code. Here are some benchmarks (x86 desktop):

Unicode key / tuple:

for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4, i+5, i+6)

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'],
4.079240 seconds, 245143.698209 keys per second

for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'],
4.985136 seconds, 200596.331486 keys per second

for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1, i+2, i+3, i+4, i+5, i+6)

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'],
3.572301 seconds, 279931.625282 keys per second

Almost halved the memory usage. And faster too. Nice.

-- Dmitry

dmtr · Aug 7, 2010

Correction. I've copy-pasted it wrong! array.array('i', (i, i+1, i+2, i
+3, i+4, i+5, i+6)) was the best.

for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4, i+5, i+6)

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'],
4.079240 seconds, 245143.698209 keys per second

for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1, i+2, i+3, i+4, i+5, i+6)

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'],
4.985136 seconds, 200596.331486 keys per second

for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'],
3.572301 seconds, 279931.625282 keys per second

-- Dmitry

Peter Otten · Aug 7, 2010

dmtr said:
I don't know to what extent it still applys but switching off cyclic
garbage collection with

import gc
gc.disable()

Click to expand...

Haven't tried it on the real dataset. On the synthetic test it (and
sys.setcheckinterval(100000)) gave ~2% speedup and no change in memory
usage. Not significant. I'll try it on the real dataset though.

while building large datastructures used to speed up things
significantly. That's what I would try first with your real data.

Encoding your unicode strings as UTF-8 could save some memory.

Click to expand...

Yes... In fact that's what I'm trying now... .encode('utf-8')
definitely creates some clutter in the code, but I guess I can
subclass dict... And it does saves memory! A lot of it. Seems to be a
bit faster too....

When your integers fit into two bytes, say, you can use an array.array()
instead of the tuple.

Click to expand...

Excellent idea. Thanks! And it seems to work too, at least for the
test code. Here are some benchmarks (x86 desktop):

Unicode key / tuple:

for i in xrange(0, 1000000): d[unicode(i)] = (i, i+1, i+2, i+3, i+4,
i+5, i+6)

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 224704 kB', 'VmSize:\t 224704 kB'],
4.079240 seconds, 245143.698209 keys per second

for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] =
array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 201440 kB', 'VmSize:\t 201440 kB'],
4.985136 seconds, 200596.331486 keys per second

for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = (i, i+1,
i+2, i+3, i+4, i+5, i+6)

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 125652 kB', 'VmSize:\t 125652 kB'],
3.572301 seconds, 279931.625282 keys per second

Almost halved the memory usage. And faster too. Nice.

def benchmark_dict(d, N):
start = time.time()

for i in xrange(N):
length = lengths[random.randint(0, 255)]
word = ''.join([ letters[random.randint(0, 255)] for i in xrange(length) ])
d[word] += 1

dt = time.time() - start
vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' % os.getpid()).read())
print "%d keys (%d unique), %s, %f seconds, %f keys per second" % (N, len(d), vm, dt, N / dt)

Looking at your benchmark, random.choice(letters) has probably less overhead
than letters[random.randint(...)]. You might even try to inline it as

letters[int(random.random())*256)]

Peter

dmtr · Aug 7, 2010

Looking at your benchmark, random.choice(letters) has probably less overhead

than letters[random.randint(...)]. You might even try to inline it as

Right... random.choice()... I'm a bit new to python, always something
to learn. But anyway in that benchmark (from http://bugs.python.org/issue9520
) the code that generate 'words' takes 90% of the time. And I'm really
looking at deltas between different methods, not the absolute value. I
was also using different code to get benchmarks for my previous
message... Here's the code:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, time, re, array

start = time.time()
d = dict()
for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] =
array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))
dt = time.time() - start
vm = re.findall("(VmPeak.*|VmSize.*)", open('/proc/%d/status' %
os.getpid()).read())
print "%d keys, %s, %f seconds, %f keys per second" % (len(d), vm, dt,
len(d) / dt)

dmtr · Aug 7, 2010

I guess with the actual dataset I'll be able to improve the memory
usage a bit, with BioPython::trie. That would probably be enough
optimization to continue working with some comfort. On this test code
BioPython::trie gives a bit of improvement in terms of memory. Not
much though...

d = dict()
for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 125656 kB', 'VmSize:\t 125656 kB'],
3.525858 seconds, 283618.896034 keys per second

from Bio import trie
d = trie.trie()
for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))

Click to expand...

Click to expand...

1000000 keys, ['VmPeak:\t 108932 kB', 'VmSize:\t 108932 kB'],
4.142797 seconds, 241382.814950 keys per second

garabik-news-2005-05 · Aug 7, 2010

dmtr said:
I guess with the actual dataset I'll be able to improve the memory
usage a bit, with BioPython::trie. That would probably be enough
optimization to continue working with some comfort. On this test code
BioPython::trie gives a bit of improvement in terms of memory. Not
much though...

d = dict()
for i in xrange(0, 1000000): d[unicode(i).encode('utf-8')] = array.array('i', (i, i+1, i+2, i+3, i+4, i+5, i+6))

Click to expand...

Click to expand...

Using struct.pack('7i',i, i+1, i+2, i+3, i+4, i+5, i+6) instead of
array.array gives 20% improvement in time with (not surprisingly)
the same memory usage.

--
-----------------------------------------------------------
| Radovan GarabÃk http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__ garabik @ kassiopeia.juls.savba.sk |
-----------------------------------------------------------
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!

Nobody · Aug 7, 2010

Steven, thank you for answering. See my comments inline. Perhaps I
should have formulated my question a bit differently: Are there any
*compact* high performance containers for unicode()/str() objects in
Python? By *compact* I don't mean compression. Just optimized for
memory usage, rather than performance.

What I'm really looking for is a dict() that maps short unicode
strings into tuples with integers.

Use UTF-8 encoded strings (str/bytes) as keys rather than unicode objects.

python-dev Summary for 2004-08-01 through 2004-08-15	17	Aug 24, 2004
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	1	Feb 1, 2004
comp.lang.vhdl FAQ part 1 of 4: general	0	Jul 8, 2003

Is there any way to minimize str()/unicode() objects memory usage[Python 2.6.4] ?

dmtr

Steven D'Aprano

Thomas Jollans

dmtr

dmtr

Chris Rebert

Neil Hodgson

Carl Banks

Michael Torrie

dmtr

garabik-news-2005-05

Peter Otten

dmtr

dmtr

Peter Otten

dmtr

dmtr

garabik-news-2005-05

Nobody

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads