Memory Error while constructing Compound Dictionary

B

Benjamin Scott

Hello.

I attempted to build a compound dictionary:

len(Lst)=1000
len(nuerLst)=250
len(nuestLst)=500

Dict={}

for s in Lst:
Dict={}

for s in Lst:
for t in nuerLst:
Dict[t]={}

for s in Lst:
for t in nuerLst:
for r in nuestLst:
Dict[t][r]={}



I got the following error:

Traceback (most recent call last):
File "<pyshell#89>", line 5, in -toplevel-
Dict[t][r]=[]
MemoryError


Specs:

Python 2.3.4
XPpro
4 GB RAM


Python was utilizing 2.0 GB when the error was generated. I have
attempted this task twice with different data sets. I got the same
error both times.

Thanks in advance for your feedback,

Benjamin Scott
 
L

Larry Bates

You are asking Python to create 125,000,000
(yes that's 125 million) empty dictionaries. The hash
(key) of the dictionary takes up at least a couple
of bytes, plus the amount of space taken up by an
empty dictionary object at each entry (125 million
of them). I'm not surprised, adds up pretty quickly.

HTH,
Larry Bates
Syscon, Inc.
 
P

Phil Frost

If you are using Linux on x86, there is a kernel configuration option
that might give more than 2GiB of address space to each process. This
will require that you compile your own kernel.

If you arn't using Linux on x86, then you are still probably hitting an
OS limitation. If you have access to an itanium, G5, or newer MIPS (SGI)
CPU, that *might* help.

If your keys are sequential integers, you will likely find numerical
python <http://numpy.sf.net> useful.
 
A

Alex Martelli

Benjamin Scott said:
len(Lst)=1000
len(nuerLst)=250
len(nuestLst)=500

So you want 1000*250*500 = 125 million dictionaries...?
Specs:

Python 2.3.4
XPpro
4 GB RAM


Python was utilizing 2.0 GB when the error was generated. I have

So you've found out that a dictionary takes at least (about) 16 bytes
even when empty -- not surprising since 16 bytes is typically the least
slice of memory the system will allocate at a time. And you've found
out that XP so-called pro doesn't let a user program have more than 2GB
to itself -- I believe there are slight workarounds for that, as in
costly hacks that may let you have 3GB or so, but it's not going to help
if you want to put any informatiion in those dictionaries, even a tiny
amount of info per dict will easily bump each dict's size to 32 bytes
and overwhelm your 32-bit processor's addressing capabilities (I'm
assuming you have a 32-bit CPU -- you don't say, but few people use
64-bitters yet).

What problem are you really trying to solve? Unless you can splurge
into a 64-bit CPU with an adequate OS (e.g., AMD 64 with a Linux for it,
or a G5-based Mac) anything requiring SO many gigabytes probably needs a
radical rethink of your intended architecture/strategy, and it's hard to
give suggestions without knowing what problem you need to solve.


Alex
 
B

Benjamin Scott

Thanks for the replies.

First I will make a minor correction to the code I originally posted
and then I will describe the original problem I am trying to solve,
per Alex's request.

Correction:

for s in Lst:
for t in nuerLst:
for r in nuestLst:
Dict[t][r]={}

....should actually be...

for s in Lst:
for t in nuerLst:
for r in nuestLst:
Dict[t][r]=[]

That is, the object accessed by 3 keys is a list, not a 4th
dictionary.


The Original Problem:

The data set: 3 Columns and at least 100,000 rows. However, it can
be up to 1,000,000 rows.

For the purpose of illustration let's suppose that the first column
has the name of 1,000 "Factories", i.e. there are 1,000 unique symbols
in the first column. Likewise, suppose the second column contains a
"production date" or just a date; there are 250 unique dates in the
second column. Finally, suppose the third column contains a
description of a "widget type"; there are 500 unique widget
descriptions.

*** i.e. each row contains the name of one factory which produced one
widget type on a particular date. If a factory produced more than one
widget on a given date it is reflected in the data as a new row. ***

The motivation to construct the mentioned compound dictionary comes
from the fact that I need quick access to the following data sets:

Data Set for Factory #1:
Column#1: time 1, time 2, ... , time 250
Column#2: #widgets, #widgets, ... , #widgets <- same widget types

Data Set for Factory #2:
Column#1: time 1, time 2, ... , time 250
Column#2: #widgets, #widgets, ... , #widgets <- same widget types

..
..
..

Data Set for Factory #1000:
Column#1: time 1, time 2, ... , time 250
Column#2: #widgets, #widgets, ... , #widgets <- same widget types

Note that if the compound dictionary was created, it would be easy to
construct these data sets like so:

File=open('DataSet', 'r')
Lst=File.readlines()
..
..
..

len(Lst[n])=3

Lst[n][0]="Factory"
Lst[n][1]="date"
Lst[n][2]="WidgetType"

for s in Lst:
Dict[s[0]][s[1]][s[2]].append('1')
..
..
..

len(Dict["Factory"]["date"]["WidgetType"]) = #Widgets of some type
produced at a Factory on a given date.

The idea here is that I will be graphing a handful of the data sets at
a time; they will then be discarded and a new handful will be
graphed... etc.

What I might attempt next is to construct the required data in R (or
NumPy) since an array object seems better suited for the task.
However, I'm not sure this will avert the memory error. So, does
anyone know how to increase the RAM limit for a process? Other
suggestions are also welcome.



-Benjamin Scott
 
D

Dennis Lee Bieber

The data set: 3 Columns and at least 100,000 rows. However, it can
be up to 1,000,000 rows.

For the purpose of illustration let's suppose that the first column
has the name of 1,000 "Factories", i.e. there are 1,000 unique symbols
in the first column. Likewise, suppose the second column contains a
"production date" or just a date; there are 250 unique dates in the
second column. Finally, suppose the third column contains a
description of a "widget type"; there are 500 unique widget
descriptions.

*** i.e. each row contains the name of one factory which produced one
widget type on a particular date. If a factory produced more than one
widget on a given date it is reflected in the data as a new row. ***

The motivation to construct the mentioned compound dictionary comes
from the fact that I need quick access to the following data sets:

Data Set for Factory #1:
Column#1: time 1, time 2, ... , time 250
Column#2: #widgets, #widgets, ... , #widgets <- same widget types
For some reason I'd suggest living with the speed of a decent
RDBMS (maybe MySQL) and generating the queries as needed.

Or, maybe, go to some fancier RDBMS -- one that supports "views"
or some other "predefined query" capability, so the result set doesn't
have to be computed each time.

--
 
J

Jeremy Bowers

The Original Problem:

The data set: 3 Columns and at least 100,000 rows. However, it can
be up to 1,000,000 rows.

For the purpose of illustration let's suppose that the first column
has the name of 1,000 "Factories", i.e. there are 1,000 unique symbols
in the first column. Likewise, suppose the second column contains a
"production date" or just a date; there are 250 unique dates in the
second column. Finally, suppose the third column contains a
description of a "widget type"; there are 500 unique widget
descriptions.

Based on these numbers, aren't the majority of your lists empty? In that
case, key off of a tuple as you need them. For some "factory", "date",
"widget" and "whatever" (the data, which you don't specify):

Dict[(factory, date, widget)].setdefault([]).append(whatever)

(You may actually benefit from changing the setdefault around to some
"try"-based scheme; you'd have to benchmark the various choices if it
matters to you. Also, depending on your data, consider Psyco.)

No extra memory. Some potentially significant slowdown due to tuple
construction and empty list construction.

(Depending on what you are doing, C++ and the STL can be not *that* much
harder to program in, and it should support hash tables with tuple keys,
although I don't know the magic incantations off the top of my head. If
you're just adding some numbers or something, this is also likely to be
enough faster that it may even be faster to develop in, since the program
will almost certainly execute much, much faster, potentially making the
write-compile-test-debug cycle still faster than Python. Of course, if
this is a prelude to complicated meta-hack programming of the type Python
makes easy, never mind; I'm kinda guessing you're just storing numbers on
the grounds that you can't store much else in the machine at once. If
that's a bad assumption, you are in some trouble, as even C++ can only be
100%-episilon efficient and it will likely be affected by the same memory
limit, in which case you either need a larger machine or a database.

Actually, you probably should go to a database now; these problems have a
habit of doubling or tripling in size when someone decides it'd be neat to
add Just One More attribute and you already pretty much can't afford that.)
 
A

Alex Martelli

Benjamin Scott said:
Thanks for the replies.

First I will make a minor correction to the code I originally posted
and then I will describe the original problem I am trying to solve,
per Alex's request.

Correction:

for s in Lst:
for t in nuerLst:
for r in nuestLst:
Dict[t][r]={}

...should actually be...

for s in Lst:
for t in nuerLst:
for r in nuestLst:
Dict[t][r]=[]

That is, the object accessed by 3 keys is a list, not a 4th
dictionary.


OK, unfortunately that doesn't change memory requirements, as 16 bytes
is still a minimum allocation for an object.
The Original Problem:

The data set: 3 Columns and at least 100,000 rows. However, it can
be up to 1,000,000 rows.

Aha -- a sparse 3D matrix, VERY sparse, no more than 1 million "true"
entries out of 125 million slots, and all the rest just
"placeholders"...
For the purpose of illustration let's suppose that the first column
has the name of 1,000 "Factories", i.e. there are 1,000 unique symbols
in the first column. Likewise, suppose the second column contains a
"production date" or just a date; there are 250 unique dates in the
second column. Finally, suppose the third column contains a
description of a "widget type"; there are 500 unique widget
descriptions.

Sure, quite clear.
*** i.e. each row contains the name of one factory which produced one
widget type on a particular date. If a factory produced more than one
widget on a given date it is reflected in the data as a new row. ***

The motivation to construct the mentioned compound dictionary comes
from the fact that I need quick access to the following data sets: ...
len(Lst[n])=3

Lst[n][0]="Factory"
Lst[n][1]="date"
Lst[n][2]="WidgetType"

for s in Lst:
Dict[s[0]][s[1]][s[2]].append('1')
.
.
.

len(Dict["Factory"]["date"]["WidgetType"]) = #Widgets of some type
produced at a Factory on a given date.

The idea here is that I will be graphing a handful of the data sets at
a time; they will then be discarded and a new handful will be
graphed... etc.

What I might attempt next is to construct the required data in R (or
NumPy) since an array object seems better suited for the task.
However, I'm not sure this will avert the memory error. So, does

When you represent a sparse matrix as if it was a dense one, you hit a
typical wall of combinatorial explosion: you need memory proportional to
the product of all the dimensions, and for a matrix of high
dimensionality it's a horrid prospect.
anyone know how to increase the RAM limit for a process? Other

With a 32-bit CPU you're SOL, IMHO. One option is to change machines:
Apple has just announced a very reasonably priced iMac G5, a 64-bit
machine intended for the home; or, you can do as I did, and look for a
little-used, well-reconditioned, guaranteed PowerMac G5 -- the latter
can use 8 GB of physical RAM and more importantly the address space is
only bounded by the amount of disk available, so a few hundred GBs may
be handled if you're in no hurry. While these are wonderful machines,
however, I think you can do better. Consider...:
suggestions are also welcome.

The Null Object Design Pattern is more likely to be what you want ( a
fancy name for what in this case is quite simple, read on...):

Start by placing in each slot of the compound dictionary the SAME
object, which is just a placeholder. So you'll still have 125 million
slots, but all initially will point at the same placeholder: so you're
spending only 125 million times the size of a SLOT, about 4 bytes, for a
total of 500 megabytes -- plus something because dictionaries being hash
table are always "overdimensioned" a bit, but you should fit very
comfortably in your 2GB anyway.

Now, as the data come in, you ADD 1 instead of APPENDING a string of '1'
to the appropriate slot. THEN and only then, for those relatively very
few cells of the 3D matrix take up space for a new object,

Moreover with the operations you appear to need you don't need to make a
special null object, I think: just the integer 0 will do, and you will
not call len() at the end since the integer is already stored in the
cell. If you wanted to store more info in each cell or somehow keep
track more directly of what cells are non-empty, etc etc, then you would
go for a more complete Null Object DP. But for your problem as stated,
the following might suffice:

1. initialize your dictionary with:

for s in Lst:
for t in nuerLst:
for r in nuestLst:
Dict[t][r] = 0

2. update it on each incoming datum with:

for s in Lst:
Dict[s[0]][s[1]][s[2]] += 1

3 consult it when done with:

Dict["Factory"]["date"]["WidgetType"] = #Widgets of some type
produced at a Factory on a given date.


Hope this helps -- if you do need a bit more, write about it here and
I'll happily show you a richer Null Object Design Pattern variant!


Alex
 
A

Alex Martelli

Jeremy Bowers said:
Based on these numbers, aren't the majority of your lists empty? In that
case, key off of a tuple as you need them. For some "factory", "date",
"widget" and "whatever" (the data, which you don't specify):

Dict[(factory, date, widget)].setdefault([]).append(whatever)

Well, he does specify a '1' as the datum, which is why I suggested
keeping the matrix dense (minimal change to his program) and just using
numbers instead of lists (he only needs the length of the lists, he
might as well be keeping the counts differently).

But, yes, indexing with the tuple would probably cut down memory
consumption further, since he's got no more than a million entries; each
entry may take a few tens of bytes (if it needs to uniquely keep a tuple
and the strings in it) so his memory consumption would plummet. Also,
he would save the long initialization phase entirely.

The parentheses are not needed, and for the "keeping counts" approach he
should be updating the counts with:

Dict[factory, date, widget] = 1 + Dict.get((factory, date, widget), 0)

Also, Dict.get((factory, date, widget), 0) is what he would be using to
check his Dict at the end. It's no doubt worth making this prettier...:

class SparseMatrix(dict):
def __getitem__(self, key): return self.get(key, 0)

so the update becomes:
Dict[factory, date, widget] += 1
and the check at the end just Dict[factory, date, widget].

(You may actually benefit from changing the setdefault around to some
"try"-based scheme; you'd have to benchmark the various choices if it
matters to you. Also, depending on your data, consider Psyco.)

If you're trying to SAVE memory, do NOT, repeat NOT, consider Psyco,
which eats it for breakfast;-).
No extra memory. Some potentially significant slowdown due to tuple
construction and empty list construction.

Nah, he's saving 125 M operations at initialization, and doing at most
about 1M updates, no way this will slow things down. But then it
appears he doesn't need the lists.
(Depending on what you are doing, C++ and the STL can be not *that* much
harder to program in, and it should support hash tables with tuple keys,
although I don't know the magic incantations off the top of my head. If

I doubt they'll beat Python dict's, which are a wonder of speed.
you're just adding some numbers or something, this is also likely to be
enough faster that it may even be faster to develop in, since the program
will almost certainly execute much, much faster, potentially making the

I think you're wrong here. C++ in terms of standard only has
order-based containers, whose performance just can't compare to hash
tables. I do believe the STL, which has been evolving independently
from C++ for many years now, has added hash tables (and surely you can
find good ones on www.boost.org, you can find EVERYTHING there and if
you're programming in C++ it should be your second home).
Actually, you probably should go to a database now; these problems have a
habit of doubling or tripling in size when someone decides it'd be neat to
add Just One More attribute and you already pretty much can't afford that.)

With a tuple-indexed SparseMatrix of numbers, his program will be taking
up a tiny fraction of the memory it's burning now, and he will have room
to spare for enhancements. And adding an attribute will cost little,
say a few tens of megabytes if he has a million entries and the new
attribute is typically a string a few tens of bytes long.

Relational databases are cool because you can make impromptu queries
(if you speak SQL) and with clever indexing the cost of reading in the
dataset is only paid once. But for his problem as stated he may well be
happier staying in Python and just using the tricks suggested in these
posts.


Suppose he did need lists and mere counts would not do. A fuller Null
Object DP can help. E.g., consider:

In [5]: class Nullo:
...: def __iadd__(self, wot): return wot
...: def __len__(self): return 0
...: def __iter__(self): return iter(())
...:

In [6]: na = Nullo()

In [7]: alist = [na] * 33

In [8]: alist[7] += [23]

In [9]: alist[7] += [45]

In [10]: alist[12] += [3]


In [12]: [(i,x) for i,x in enumerate(alist) if x]
Out[12]: [(7, [23, 45]), (12, [3])]


Some might consider this an abuse of __iadd__ (shouldn't it "always"
return self?!) but I consider it a case of "practicality beats purity".

Use one (ONE) instance of this Nullo class as the default object
returned by the above-shown class SparseMatrix, and you have a sparse
matrix of "lists" (most of them are that one Nullo instance, which
behaves, ducktyping-like, as an empty list, for the requested uses; the
non-empty ones are bona fide lists)...


Alex
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top