Memory Problem

C

Christoph Scheit

Hi,

I have a short script/prog in order to read out binary files from a numerical
simulation. This binary files still need some post-processing, which is
summing up results from different cpu's, filtering out non-valid entrys
and bringing the data in some special order.

Reading the binary data in using the struct-module works fine - I read
one chunk of data into a tuple, this tupel I append to a list.
At the end of reading, I return the list.

Then the data is added to a table, which I use for the actual Post-Processing.
The table is actually a Class with several "Columns", each column internally
being represented by array.
Now adding all the data from the simulation results to the table makes the
memory usage exploding. So I would like to know, where exactly the memory
is vasted.

Here the code to add the data of one file (I have to add the data of various
files to the same table in total)

# create reader
breader = BDBReader("<var>", "<type>", "#")

# read data
bData = breader.readDB(dbFileList[0])

# create table
dTab = DBTable(breader.headings, breader.converters, [1,2])
addRows(bData, dTab)

Before I add a new entry to the table, I check if there is already an entry
like this. To do so, I store keys for all the entries with row-number in a
dictionary. What about the memory consumption of the dictionary?

Here the code for adding a new row to the table:

# check if data already exists
if (self.keyDict.has_key(key)):
rowIdx = self.keyDict[key]
for i in self.mutableCols:
self.cols[rowIdx] += rowData
return

# key is still available - insert row to table
self.keyDict[key] = self.nRows

# insert data to the columns
for i in range(0, self.nCols):
self.cols.add(rowData)

# add row i and increment number of rows
self.rows.append(DBRow(self, self.nRows))
self.nRows += 1

Maybe somebody can help me. If you need, I can give more implementation
details.

Thanks in advance,

Christoph
--

============================
M.Sc. Christoph Scheit
Institute of Fluid Mechanics
FAU Erlangen-Nuremberg
Cauerstrasse 4
D-91058 Erlangen
Phone: +49 9131 85 29508
============================
 
M

Marc 'BlackJack' Rintsch

Then the data is added to a table, which I use for the actual Post-Processing.
The table is actually a Class with several "Columns", each column internally
being represented by array.

Array or list?
# create reader
breader = BDBReader("<var>", "<type>", "#")

# read data
bData = breader.readDB(dbFileList[0])

# create table
dTab = DBTable(breader.headings, breader.converters, [1,2])
addRows(bData, dTab)

Before I add a new entry to the table, I check if there is already an entry
like this. To do so, I store keys for all the entries with row-number in a
dictionary. What about the memory consumption of the dictionary?

The more items you put into the dictionary the more memory it uses. ;-)
Here the code for adding a new row to the table:

# check if data already exists
if (self.keyDict.has_key(key)):
rowIdx = self.keyDict[key]
for i in self.mutableCols:
self.cols[rowIdx] += rowData
return

# key is still available - insert row to table
self.keyDict[key] = self.nRows

# insert data to the columns
for i in range(0, self.nCols):
self.cols.add(rowData)

# add row i and increment number of rows
self.rows.append(DBRow(self, self.nRows))
self.nRows += 1

Maybe somebody can help me. If you need, I can give more implementation
details.


IMHO That's not enough code and/or description of the data structure(s).
And you also left out some information like the number of rows/columns and
the size of the data.

Have you already thought about using a database?

Ciao,
Marc 'BlackJack' Rintsch
 
C

Christoph Scheit

Array or list?

array

More details:
class DBTable:
# the class DBTable has a list, each list entry referencing a DBColu bject
self.cols = []

self.dict = {, -1} #the dictionary is used to look up if an entry
# already exists

class DBColumn:
# has a name (string and a datatype (int, float, e.g.) as attribute plus
self.data = array('f') # an array of type float

I have to deal with several millions of data, actually I'm trying an example
with
360 grid points and 10000 time steps, i.e. 3 600 000 entries (and each row
consits of 4 int and one float)

Of course, the more keys the bigger is the dictionary, but is there a way to
evaluate the actual size of the dictionary?

Greets and Thanks,

Chris
# create reader
breader = BDBReader("<var>", "<type>", "#")

# read data
bData = breader.readDB(dbFileList[0])

# create table
dTab = DBTable(breader.headings, breader.converters, [1,2])
addRows(bData, dTab)

Before I add a new entry to the table, I check if there is already an
entry like this. To do so, I store keys for all the entries with
row-number in a dictionary. What about the memory consumption of the
dictionary?

The more items you put into the dictionary the more memory it uses. ;-)
Here the code for adding a new row to the table:

# check if data already exists
if (self.keyDict.has_key(key)):
rowIdx = self.keyDict[key]
for i in self.mutableCols:
self.cols[rowIdx] += rowData
return

# key is still available - insert row to table
self.keyDict[key] = self.nRows

# insert data to the columns
for i in range(0, self.nCols):
self.cols.add(rowData)

# add row i and increment number of rows
self.rows.append(DBRow(self, self.nRows))
self.nRows += 1

Maybe somebody can help me. If you need, I can give more implementation
details.


IMHO That's not enough code and/or description of the data structure(s).
And you also left out some information like the number of rows/columns and
the size of the data.

Have you already thought about using a database?

Ciao,
Marc 'BlackJack' Rintsch


--

============================
M.Sc. Christoph Scheit
Institute of Fluid Mechanics
FAU Erlangen-Nuremberg
Cauerstrasse 4
D-91058 Erlangen
Phone: +49 9131 85 29508
============================
 
G

Gabriel Genellina

En Tue, 18 Sep 2007 10:58:42 -0300, Christoph Scheit
I have to deal with several millions of data, actually I'm trying an
example
with
360 grid points and 10000 time steps, i.e. 3 600 000 entries (and each
row
consits of 4 int and one float)
Of course, the more keys the bigger is the dictionary, but is there a
way to
evaluate the actual size of the dictionary?

Yes, but probably you should not worry about it, just a few bytes per
entry.
Why don't you use an actual database? sqlite is fast, lightweight, and
comes with Python 2.5

This looks suspicious, and may indicate that your structure contains
cycles, and Python cannot always recall memory from those cycles, and you
end using much more memory than needed.
 
B

Bruno Desthuilliers

Christoph Scheit a écrit :
(snip)

I have to deal with several millions of data, actually I'm trying an example
with
360 grid points and 10000 time steps, i.e. 3 600 000 entries (and each row
consits of 4 int and one float)

Hem... My I suggest that you use a database then ? If you don't want to
bother with a full-blown RDBMS, then have a look at SQLite - it's
lightweight, works mostly fine and is a no-brainer to use.
Of course, the more keys the bigger is the dictionary, but is there a way to
evaluate the actual size of the dictionary?

You can refer to the thread "creating really big lists" for a Q&D, raw
approx of such an evaluation. But it's way too big anyway to even
consider storing all this in ram.
 
C

Christoph Scheit

Hi, Thank you all very much,

so I will consider using a database. Anyway I would like
how to detect cycles, if there are.
This looks suspicious, and may indicate that your structure contains
cycles, and Python cannot always recall memory from those cycles, and you
end using much more memory than needed.

How can I detect if there are cycles?

self.rows is a list containing DBRow-objects,
each itself being an integer pointer (index) to the i-th row.
Im using this list in order to sort the table by sorting the index-list
instead of realy sorting the entries. (or to filter).

--

============================
M.Sc. Christoph Scheit
Institute of Fluid Mechanics
FAU Erlangen-Nuremberg
Cauerstrasse 4
D-91058 Erlangen
Phone: +49 9131 85 29508
============================
 
G

Gabriel Genellina

En Tue, 18 Sep 2007 12:24:46 -0300, Christoph Scheit
How can I detect if there are cycles?

Analyzing your code, or maybe inspecting gc.garbage, or looking at
sys.getrefcount(x)
self.rows is a list containing DBRow-objects,
each itself being an integer pointer (index) to the i-th row.
Im using this list in order to sort the table by sorting the index-list
instead of realy sorting the entries. (or to filter).

What looks strange is the "self" argument to DBRow, since the items are
already contained in self.rows
But it's hard to tell anything more without looking at your code.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top