Large Two Dimensional Array

A

Ayushi Dalmia

Hello,

I am trying to implement IBM Model 1. In that I need to create a matrix of 50000*50000 with double values. Currently I am using dict of dict but it isunable to support such high dimensions and hence gives memory error. Any help in this regard will be useful. I understand that I cannot store the matrix in the RAM but what is the most efficient way to do this?
 
A

Ayushi Dalmia

Hello,



I am trying to implement IBM Model 1. In that I need to create a matrix of 50000*50000 with double values. Currently I am using dict of dict but it is unable to support such high dimensions and hence gives memory error. Anyhelp in this regard will be useful. I understand that I cannot store the matrix in the RAM but what is the most efficient way to do this?

Also, the matrix indices are words and not integers. I donot want to map.
 
D

David Froger

Quoting Ayushi Dalmia (2014-01-29 06:25:54)
Hello,

I am trying to implement IBM Model 1. In that I need to create a matrix of 50000*50000 with double values. Currently I am using dict of dict but it is unable to support such high dimensions and hence gives memory error. Anyhelp in this regard will be useful. I understand that I cannot store the matrix in the RAM but what is the most efficient way to do this?

Hello,

I would suggest using h5py [1] or PyTables [2] to store data on disk (both are
based on HDF5 [3]), and manipulate data in RAM as NumPy [4] arrays.

[1] www.h5py.org
[2] www.pytables.org
[3] www.hdfgroup.org/HDF5
[4] www.numpy.org
 
A

Ayushi Dalmia

Hello,



I am trying to implement IBM Model 1. In that I need to create a matrix of 50000*50000 with double values. Currently I am using dict of dict but it is unable to support such high dimensions and hence gives memory error. Anyhelp in this regard will be useful. I understand that I cannot store the matrix in the RAM but what is the most efficient way to do this?

Thanks David!
 
D

Denis McMahon

Hello,

I am trying to implement IBM Model 1. In that I need to create a matrix
of 50000*50000 with double values. Currently I am using dict of dict but
it is unable to support such high dimensions and hence gives memory
error. Any help in this regard will be useful. I understand that I
cannot store the matrix in the RAM but what is the most efficient way to
do this?

This looks to me like a table with columns:

word1 (varchar 20) | word2 (varchar 20) | connection (double)

might be your best solution, but it's going a huge table (2G5 rows)

The primary key is going to be the combination of all 3 columns (or
possibly the combination of word1 and word2) and you want indexes on
word1 and word2, which will slow down populating the table, but speed up
searching it, and I assume that searching is going to be a much more
frequent operation than populating.

Also, creating a database has the additional advantage that next time you
want to use the program for a conversion between two languages that
you've previously built the data for, the data already exists in the
database, so you don't need to build it again.

I imagine you would have either one table for each language pair, or one
table for each conversion (treating a->b and b->a as two separate
conversions).

I'm also guessing that varchar 20 is long enough to hold any of your
50,000 words in either language, that value might need adjusting
otherwise.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top